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Preface 


I have been teaching probability and mathematical statistics to graduate students for 
close to 50 years. In my career I realized that the most difficult task for students is 
solving problems. Bright students can generally grasp the theory easier than apply it. 
In order to overcome this hurdle, I used to write examples of solutions to problems and 
hand it to my students. I often wrote examples for the students based on my published 
research. Over the years I have accumulated a large number of such examples and 
problems. This book is aimed at sharing these examples and problems with the 
population of students, researchers, and teachers. 

The book consists of nine chapters. Each chapter has four parts. The first part 
contains a short presentation of the theory. This is required especially for establishing 
notation and to provide a quick overview of the important results and references. The 
second part consists of examples. The examples follow the theoretical presentation. 
The third part consists of problems for solution, arranged by the corresponding sec- 
tions of the theory part. The fourth part presents solutions to some selected problems. 
The solutions are generally not as detailed as the examples, but as such these are 
examples of solutions. I tried to demonstrate how to apply known results in order to 
solve problems elegantly. All together there are in the book 167 examples and 431 
problems. 

The emphasis in the book is on statistical inference. The first chapter on proba- 
bility is especially important for students who have not had a course on advanced 
probability. Chapter Two is on the theory of distribution functions. This is basic to 
all developments in the book, and from my experience, it is important for all students 
to master this calculus of distributions. The chapter covers multivariate distributions, 
especially the multivariate normal; conditional distributions; techniques of determin- 
ing variances and covariances of sample moments; the theory of exponential families; 
Edgeworth expansions and saddle-point approximations; and more. Chapter Three 
covers the theory of sufficient statistics, completeness of families of distributions, 
and the information in samples. In particular, it presents the Fisher information, the 
Kullback—Leibler information, and the Hellinger distance. Chapter Four provides a 
strong foundation in the theory of testing statistical hypotheses. The Wald SPRT is 


XV 


xvi PREFACE 


discussed there too. Chapter Five is focused on optimal point estimation of differ- 
ent kinds. Pitman estimators and equivariant estimators are also discussed. Chap- 
ter Six covers problems of efficient confidence intervals, in particular the problem of 
determining fixed-width confidence intervals by two-stage or sequential sampling. 
Chapter Seven covers techniques of large sample approximations, useful in estima- 
tion and testing. Chapter Eight is devoted to Bayesian analysis, including empirical 
Bayes theory. It highlights computational approximations by numerical analysis and 
simulations. Finally, Chapter Nine presents a few more advanced topics, such as 
minimaxity, admissibility, structural distributions, and the Stein-type estimators. 

I would like to acknowledge with gratitude the contributions of my many ex- 
students, who toiled through these examples and problems and gave me their impor- 
tant feedback. In particular, I am very grateful and indebted to my colleagues, 
Professors A. Schick, Q. Yu, S. De, and A. Polunchenko, who carefully read parts 
of this book and provided important comments. Mrs. Marge Pratt skillfully typed 
several drafts of this book with patience and grace. To her I extend my heartfelt 
thanks. Finally, I would like to thank my wife Hanna for giving me the conditions 
and encouragement to do research and engage in scholarly writing. 


SHELEMYAHU ZACKS 
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CHAPTER 1 


Basic Probability Theory 


PART I: THEORY 


It is assumed that the reader has had a course in elementary probability. In this chapter 
we discuss more advanced material, which is required for further developments. 


1.1 OPERATIONS ON SETS 


Let S denote a sample space. Let £, Ez be subsets of S. We denote the union by 
E, U E» and the intersection by FE; M Ep. E =S-— E denotes the complement of 
E. By DeMorgan’s laws E; U Ey = E; 0 E, and E, N Ey = E, U Ep. 

Given a sequence of sets {E,,,n > 1} (finite or infinite), we define 


sup En = |(_) En, inf E, =( | En. (1.1.1) 


nz] n>1 n>1 


Furthermore, lim inf and lim sup are defined as 
OO n>oo 


lim inf E, = |_)() Ex. lim sup En = (|) Ex. (1.1.2) 
n>1k>n n>1k>n 
If a point of S belongs to limsupE,, it belongs to infinitely many sets E,,. The sets 
n> Co 
lim inf £,, and lim sup E,, always exist and 


n—- oo noo 


liminf E,, C lim sup E,,. (1.1.3) 


noo noo 
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2 BASIC PROBABILITY THEORY 


If liminf E,, = lim sup E,,, we say that a limit of {Z,,,n > 1} exists. In this case, 
EZ n—>0o 


lim E,, = liminf E,, = limsup E,,. (1.1.4) 
n—->oo n> Co n—->oo 


A sequence {E,,,n > 1} is called monotone increasing if E,, C E,+, foralln > 1.In 
CO 


this case lim E, = JB The sequence is monotone decreasing if E,, > E,+,, for 
n—->oco 


n=1 
0° 


alln > 1. In this case lim E, = ()En- We conclude this section with the definition 
noo 
n=1 
of a partition of the sample space. A collection of sets D = {E),..., Ex} is called 
a finite partition of S if all elements of D are pairwise disjoint and their union 
k 


is S,ie., Ej; OE; = 6 for alli A j; Ej, Ej € D; and JE: = S. If D contains a 


i=1 
ioe) 


countable number of sets that are mutually exclusive and Jz i; = S, we say that D 
i=l 
is a countable partition. 


1.22 ALGEBRA AND o-FIELDS 


Let S be a sample space. An algebra A is a collection of subsets of S satisfying 


@ SEA; 
(ii) if ¢A then ECA; (1.2.1) 
(iii) if £,,£.¢A then EL; UE. € A. 


We consider # = S. Thus, (i) and (ii) imply that % € A. Also, if E;, E, € A then 
E;,\NE, € A. 

The trivial algebra is Ay = {@, S}. An algebra A, is a subalgebra of Ap if all sets 
of A; are contained in A. We denote this inclusion by A; C Ap. Thus, the trivial 
algebra Ag is a subalgebra of every algebra A. We will denote by A(S), the algebra 
generated by all subsets of S (see Example 1.1). 

If a sample space S has a finite number of points n, say | <n < on, then the col- 
lection of all subsets of S is called the discrete algebra generated by the elementary 
events of S. It contains 2” events. 

Let D be a partition of S having k, 2 < k, disjoint sets. Then, the algebra generated 
by D, A(D), is the algebra containing all the 2 — 1 unions of the elements of D and 
the empty set. 
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An algebra on S is called a o-field if, in addition to being an algebra, the following 
holds. 


oe) 
(iv) If E, € A,n > 1, then OE eA. 


n=1 


We will denote a o-field by F. In a o-field F the supremum, infinum, limsup, and 
liminf of any sequence of events belong to F. If S is finite, the discrete algebra A(S) 
is a o-field. In Example 1.3 we show an algebra that is not a o-field. 

The minimal o-field containing the algebra generated by {(—00, x], -o0 < x < 
co} is called the Borel o-field on the real line R. 

A sample space S, with a o-field F, (S, F) is called a measurable space. 

The following lemmas establish the existence of smallest o-field containing a 
given collection of sets. 


Lemma 1.2.1. Let € be a collection of subsets of a sample space S. Then, there 
exists a smallest o-field F(E), containing the elements of E. 


Proof. The algebra of all subsets of S, A(S) obviously contains all elements of €. 
Similarly, the o-field F containing all subsets of S, contains all elements of €. Define 
the o-field F(E€) to be the intersection of all o-fields, which contain all elements of 
E. Obviously, F(E) is an algebra. QED 


A collection M of subsets of S is called a monotonic class if the limit of any 
monotone sequence in M belongs to M. 

If € is acollection of subsets of S, let M*(€) denote the smallest monotonic class 
containing €. 


Lemma 1.2.2. A necessary and sufficient condition of an algebra A to be a o-field 
is that it is a monotonic class. 


Proof. (i) Obviously, if A is a o-field, it is a monotonic class. 
(ii) Let A be a monotonic class. 


Let E, € A,n > 1. Define B, = |_JE;. Obviously B, C By41 foralln > 1.Hence 


i= = 
0° 


jim Bp = -Us. EA. But| JB, = = JE. Thus, pe EA. Similarly, ()E, EA. 


n=1 n=1 n=1 


Thus, A is a o-field. QED 


Theorem 1.2.1. Let A be an algebra. Then M*(A) = F(A), where F(A) is the 
smallest o-field containing A. 
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Proof. See Shiryayev (1984, p. 139). 

The measurable space (R, 8), where R is the real line and B = F(R), called the 
Borel measurable space, plays a most important role in the theory of statistics. 
Another important measurable space is (R”, 6”), n > 2, where RR" = Rx Rx--- x 
R is the Euclidean n-space, and 6” = 6B x --- x Bis the smallest o-field containing 
R", @, and all n-dimensional rectangles J = I; x --- x I,, where 


[,;=(q,5;)), i= 1,...,n, -w<a <b < oH. 


The measurable space (R®, B°) is used as a basis for probability models of 
experiments with infinitely many trials. R°© is the space of ordered sequences 
X = (%1, %2,...), —-OO < X, < 00,n = 1,2,.... Consider the cylinder sets 


Ci x-:-xi,)={x:x €];, i=1,...,n} 
and 
C(B, x --- xX B,) ={x:x; € By, i=1,...,n} 


where B; are Borel sets, i.e., B; € B. The smallest o -field containing all these cylinder 
sets,n > 1, is B(R®). Examples of Borel sets in B(R®™) are 


(a) {x: x e€ R®, supx, > a} 


n>1 


or 


(b) {x : x € R™, limsupx, < a}. 


n—- oOo 


1.3 PROBABILITY SPACES 


Given a measurable space (S, F), a probability model ascribes a countably additive 
function P on ¥, which assigns a probability P{A} to all sets A ¢ F. This function 
should satisfy the following properties. 


(A.1) IfA € FthenO < P{A} < 1. 
(A.2) P{S}=1. (1.3.1) 
(A.3) If{E,,n = 1} € Fis a sequence of disjoint 


(oe) 


sets inF, then P (Ue. = SOPIEw: (1.3.2) 
n=1 


n=1 
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Recall that if A C B then P{A} < P{B}, and P{A} = 1 — P{A}. Other properties 
will be given in the examples and problems. In the sequel we often write AB for 
ANB. 


Theorem 1.3.1. Let(S, 7, P) be a probability space, where F is ao -field of subsets 
of S and P a probability function. Then 


(i) if By C Bua, n > 1, By € F, then 
P{ tim B,| — lim P{B,}. (1.3.3) 
n->o n—-> Oo 


(ii) if By D By4i,n = 1, B, € Ff, then 


P | lim B,| = lim P{B,}. (1.3.4) 
no n->o 
oe) 
Proof. (i) Since B, C Bn41, lim B, = | JB. Moreover, 
n—>o Hal 
CO CO 
P U | = P{B\}+ > P{B, — By}. (1.3.5) 
n=1 n=2 


Notice that for n > 2, since B, B,—1 = 9, 


P{B, = Bn-1} = P{By Bn—1} 
= P{B,} _ P{B,Br-1} = P{Bn} _ P{By_-1}. 


(1.3.6) 


Also, in (1.3.5) 


P{Bi}+ >> P{B, — By} 
n=2 


N 
wim. (un + SO(P{Bn} - Pte...) 


n=2 


jim, P(By}. (1.3.7) 
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Thus, Equation (1.3.3) is proven. 


CO 
(ii) Since By D Byyi.n > 1, By C Byyi.n > 1. lim B, = { )B,. Hence, 
noo 


n=1 


QED 


Sets in a probability space are called events. 


1.4 CONDITIONAL PROBABILITIES AND INDEPENDENCE 


The conditional probability of an event A € F given an event B € F such that 
P{B} > 0, is defined as 


_ P{AN B} 
P{A| B}= Si (1.4.1) 


We see first that P{- | B} is a probability function on F. Indeed, for every A € Ff, 
0 < P{A| B} < 1. Moreover, P{S | B} = 1 and if A; and A are disjoint events in 
F, then 


P{(A, U Az) B} 
P{B} 

_ P{A,B} + P{A2B} 

~ P{B} 


P{A,U A, |B} = 
(1.4.2) 
= P{A, | B} + P{Ap | B}. 


If P{B} > 0 and P{A} ~ P{A | B}, we say that the events A and B are depen- 
dent. On the other hand, if P{A} = P{A | B} we say that A and B are independent 
events. Notice that two events are independent if and only if 


P{AB} = P{A}P{B}. (1.4.3) 
Given n events in A, namely Aj,..., An, we say that they are pairwise independent 
if P{A;A;} = P{A;}P{A,} for any i 4 j. The events are said to be independent in 


triplets if 


P{Aj;Aj Ax} = P{Aj}P{Aj}P{Ax} 
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for any i 4 7 #k. Example 1.4 shows that pairwise independence does not imply 
independence in triplets. 

Given n events Aj,..., A, of F, we say that they are independent if, for any 
2<k <nandany k-tuple (1 <i) <i2 <---<ip <n), 


k 
P4(\Ai ¢ =| [ PtAi}- (1.4.4) 
j=l 


Events in an infinite sequence {A , Az,...} are said to be independent if 
{A,,..., A,} are independent, for eachn > 2. Given a sequence of events A;, A2,... 
of a o-field F, we have seen that 


CoO Cw 
HELD A = () U Ax. 


n=1k=n 


This event means that points w in lim sup A,, belong to infinitely many of the events 
n—->oo 
{A,}. Thus, the event lim sup A, is denoted also as {A,, i.0.}, where i.o. stands for 
noo 
“infinitely often.” 


The following important theorem, known as the Borel—Cantelli Lemma, gives 
conditions under which P{A,,, i.0.} is either 0 or 1. 


Theorem 1.4.1 (Borel—Cantelli). Let {A,,} be a sequence of sets in F. 


(i) If ) > P{An} < 00, then P{Ay,i.0.} = 0. 


n=1 
lo) 


(ii) If) P{An} = oc and {A,} are independent, then P{Ay, i.o.} = 1. 


n=1 


[o.e) 
Proof. (i) Notice that B, = | JAc is a decreasing sequence. Thus 


k=n 


P{An,i.0.} = P in | = lim P{B,}. 


n=1 


But 


P{B,} = P lU a < >5 P{Ad. 
k=n 


k=n 
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CO CO 
The assumption that DP Ayn} < co implies that im d P{A,z} = 0. 
(ii) Since A,, Az, ... are independent, A,, A, ... are independent. This implies 
that 


k=1 k=1 


P fe 4 =|] Pt40 =] [a - Pan). 
k=1 


If 0 < x < 1 then log(1 — x) < —x. Thus, 


log] [( — P{Ax}) = Do log(l — P{Ax}) 


k=1 k=1 


o) oe) 

since SPA = oo: Thus P (A = 0 for all n> 1. This implies that 
n=1 k=1 

P{An,i.0.} = 1. QED 


We conclude this section with the celebrated Bayes Theorem. 

Let D = {B;,i € J} bea partition of S, where J is an index set having a finite or 
countable number of elements. Let Bj; €¢ F and P{B;} > Oforall j ¢ J. LetA € F, 
P{A} > 0. We are interested in the conditional probabilities P{B; | A}, j € J. 


Theorem 1.4.2 (Bayes). 
P{B;}P{A | B; 
P{B; | A}= Le EA By : (1.4.5) 
Yi PIB} PIA | By} 
ved 
Proof. Left as an exercise. QED 


Bayes Theorem is widely used in scientific inference. Examples of the application 
of Bayes Theorem are given in many elementary books. Advanced examples of 
Bayesian inference will be given in later chapters. 


1.5 RANDOM VARIABLES AND THEIR DISTRIBUTIONS 


Random variables are finite real value functions on the sample space S, such that 
measurable subsets of F are mapped into Borel sets on the real line and thus can be 
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assigned probability measures. The situation is simple if S contains only a finite or 
countably infinite number of points. 

In the general case, S might contain non-countable infinitely many points. Even if 
S is the space of all infinite binary sequences w = (ij, i2, ...), the number of points 
in S is non-countable. To make our theory rich enough, we will require that the 
probability space will be (S, F, P), where F is a o-field. A random variable X is a 
finite real value function on S. We wish to define the distribution function of X, on 
R, as 


Fx(x) = P{w : X(w) < x}. (1.5.1) 


For this purpose, we must require that every Borel set on R has a measurable inverse 
image with respect to F. More specifically, given (S, F, P), let (R, 8) be Borel 
measurable space where R is the real line and 6 the Borel o-field of subsets of R. A 
subset of (R, B) is called a Borel set if B belongs to B. Let X : S — R. The inverse 
image of a Borel set B with respect to X is 


X—!(B) = {w : X(w) € Bh. (1.5.2) 


A function X : S > R is called F-measurable if X~!(B) € F for all B € B. Thus, 
a random variable with respect to (S, 7, P) is an ¥-measurable function on S. 
The class Fy = {X~!(B): B € B}isalsoao-field, generated by the random variable 
X. Notice that Fy Cc F. 

By definition, every random variable X has a distribution function Fy. The prob- 
ability measure P, {-} induced by X on (R, B) is 


Pxy{B} = P{X7|(B)}, BeB. (1.5.3) 
A distribution function Fy is a real value function satisfying the properties 


(Gi) lim Fy(x) = 0; 
xXx—>—00 
(ii) lim Fy(x) = 1; 
x—>0O 
(iii) If x; < x2 then Fy(x,) < Fy(x2); and 
(iv) Tim Fx (x +e) = Fy(x), and nee —€)= Fy(x—), all -—co <x < ~. 


Thus, a distribution function F is right-continuous. 
Given a distribution function Fy, we obtain from (1.5.1), for every —oo <a < 
b<o, 


P{w:a< X(w) <b} = Fx(b) — Fx(a) (1.5.4) 
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and 
P{w : X(w) = xo} = Fy(xo0) — Fx(xo—). (1.5.5) 


Thus, if Fy is continuous at a point xo, then P{w : X(w) = xo} = 0. If X isarandom 
variable, then Y = g(X) is a random variable only if g is B-(Borel) measurable, 
i.e., for any B € B, g~'(B) € B. Thus, if Y = g(X), g is B-measurable and X F- 
measurable, then Y is also F-measurable. The distribution function of Y is 


Fy(y) = P{w : g(X(w)) < yh. (1.5.6) 


Any two random variables X, Y having the same distribution are equivalent. We 
denote this by Y ~ X. 

A distribution function F may have a countable number of distinct points of 
discontinuity. If x9 is a point of discontinuity, F(x) — F(xo—) > 0. In between 
points of discontinuity, F is continuous. If F assumes a constant value between 
points of discontinuity (step function), it is called discrete. Formally, let —oo < 
x1 <X2 <-:++-< o be points of discontinuity of F. Let Z4(x) denote the indicator 
function of a set A, i.e., 


1, ifxeA 
IAQ) = ts otherwise. 


Then a discrete F can be written as 


Fa(x) = DO Axa FH) 
an (1.5.7) 
= )0 (F@i)— F@i-)). 
} 


{x; <x 


Let jz; and p22 be measures on (R, 8). We say that jz; is absolutely continuous 
with respect to (42, and write uw; K p2,if B € Band w2(B) = Othen w)(B) = 0. Leta 
denote the Lebesgue measure on (R, 5). For every interval (a, b], -coo <a <b<o, 
A((a, b]) = b — a. The celebrated Radon—Nikodym Theorem (see Shiryayev, 1984, 
p. 194) states that if 4) « pz and jZ), {42 are o-finite measures on (R, {), there exists 
a B-measurable nonnegative function f(x) so that, for each B € B, 


M(B) = i, f(x)d12(x) (1.5.8) 
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where the Lebesgue integral in (1.5.8) will be discussed later. In particular, if P, 
is absolutely continuous with respect to the Lebesgue measure i, then there exists a 
function f > 0 so that 


PB) = | fore), BeB. (1.5.9) 
B 


Moreover, 
F(x) = / fo)dy, -wW<x <M. (1.5.10) 
—oo 


A distribution function F is called absolutely continuous if there exists a non- 
negative function f such that 


é 
F(é) aif f(x)dx, —00 <E <0. (1.5.11) 


The function f, which can be represented for “almost all x” by the derivative of F, 
is called the probability density function (p.d.f.) corresponding to F. 


If F is absolutely continuous, then f(x) = — F(x) “almost everywhere.” The 


term “almost everywhere” or “almost all” x means for all x values, excluding maybe 
on a set N of Lebesgue measure zero. Moreover, the probability assigned to any 
interval (a, B],a < B, 1s 


B 
Pla< X < Bp} = F(B)—- F(@)= / f(x)dx. (1.5.12) 
Due to the continuity of F we can also write 
Pla<X <p} =Pla<X < f}. 


Often the density functions f are Riemann integrable, and the above integrals are 
Riemann integrals. Otherwise, these are all Lebesgue integrals, which are defined in 
the next section. 

There are continuous distribution functions that are not absolutely continuous. 
Such distributions are called singular. An example of a singular distribution is the 
Cantor distribution (see Shiryayev, 1984, p. 155). 

Finally, every distribution function F(x) is a mixture of the three types of 
distributions—discrete distribution F4(-), absolutely continuous distributions F4(-), 
and singular distributions F,(-). That is, for some 0 < pj, p2, p3 < 1 such that 
Pit po.t+ p3=1, 


F(x) = pi Fa(x) + p2Fac(x) + p3Fs(*). 


In this book we treat only mixtures of Fy(x) and Fy¢(x). 
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1.66 THE LEBESGUE AND STIELTJES INTEGRALS 


1.6.1 General Definition of Expected Value: The Lebesgue Integral 


Let (S, F, P) bea probability space. If X is a random variable, we wish to define the 
integral 


E(x) = [ xon PU) (1.6.1) 
Ss 


We define first E{X} for nonnegative random variables, i.e., X(w) > 0 for all 
w €S. Generally, X = X* — X~, where X*(w) = max(0, X(w)) and X~(w) = 
—min(O, X(w)). 

Given a nonnegative random variable X we construct for a given finite integer n 
the events 


and 
Anzr4in — {w : X(w) = n}. 


These events form a partition of S. Let X,,n > 1, be the discrete random variable 
defined as 


n2” 


=A 
Xn(w) = >> Fa Then) + 1D Ay gn W)> (1.6.2) 
k=1 


Notice that for each w, X,(w) < Xnii(w) < ... < X(w) forall n. Also, ifw € Agan, 
k=1,...,n2”", then |X(w) — X,(w)| < of Moreover, Anz 41,n D A(n+1)2"41,n+1> 
alln > 1. Thus 


(oe) 


Tim Angrsin = ( \bv : Xv) > 7} 
n=1 
= 0. 


Thus for allw eS 


lim X,(w) = X(w). (1.6.3) 
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Now, for each discrete random variable X,,(w) 


n2” 
—1 
E{X,} = pS “pa ff Ae +nP{w: X(w) > n}. (1.6.4) 
Obviously E{X,,} <n, and E{Xn41} > E{X,}. Thus, lim £{X,,} exists (it might be 
n—>Ooo 


+oo). Accordingly, the Lebesgue integral is defined as 


E(x) = | xonPtdw) 
(1.6.5) 


The Lebesgue integral may exist when the Riemann integral does not. For example, 
consider the probability space (Z, B, P) where J = {x :0 < x < 1}, B the Borel 
o-field on Z, and P the Lebesgue measure on [8]. Define 


we 0, if x is irrational on [0, 1] 

FON aeaeacaationalon (0. 11: 

Let By = {x :0<x<1, f(x) =0}, B, = [0,1] — Bo. The Lebesgue integral 
of f is 


1 
i S(x)dx =0-P{Bo}+1- P{B,} =0, 
0 


since the Lebesgue measure of By is zero. On the other hand, the Riemann integral of 
F(x) does not exist. Notice that, contrary to the construction of the Riemann integral, 


the Lebesgue integral / f(x)P{dx} of a nonnegative function f is obtained by par- 


titioning the range of the function f to 2” subintervals D, = { By } and construct- 

ing a discrete random variable f, = ae € Bh where fp; = inf{ f(x) : 
j=l ‘ 

xe Bey, The expected value of f, is E{f,} = bayer. P(X € BY”). The sequence 

{E{ a8 n > 1} is nondecreasing, and its limit oe (might be +00). Generally, we 

define 


E{X} = E{xt})— E{x7} (1.6.6) 


if either E{X*} < o or E{X7} < oo. 
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If E{X*} = 00 and E{X~} = oo, we say that E{X} does not exist. As a special 
case, if F is absolutely continuous with density f, then 


E{x}= ie xf (x)dx 


[o.e) 
provided / |x| f(x)dx < oo. If F is discrete then 
—0o 


E{X}= pe ae. = Xp} 


n=1 


provided it is absolutely convergent. 

From the definition (1.6.4), it is obvious that if P{X(w) > O} = 1 then E{X} > 0. 
This immediately implies that if X and Y are two random variables such that P{w : 
X(w) > Y(w)} = 1, then E{X — Y} > O. Also, if E{X} exists then, for all A € F, 


E{|X|Ia(X)} < E{|X]}, 


and E{XI,4(X)} exists. If E{X} is finite, E{X I,4(X)} is also finite. From the definition 
of expectation we immediately obtain that for any finite constant c, 
E{cX} =cE{X}, 


(1.6.7) 
E{X +Y} = E{X}+4+ Ef{Y¥}. 


Equation (1.6.7) implies that the expected value is a linear functional, i-c., if 
X,..., Xn are random variables on (S, F, P) and Bo, B1,..., By are finite con- 
stants, then, if all expectations exist, 


£ {m+ 3 Ax = Po+ >_ BiE{Xi}. (1.6.8) 
i=1 i=1 


We present now a few basic theorems on the convergence of the expectations of 
sequences of random variables. 


Theorem 1.6.1 (Monotone Convergence). Let {X,} be a monotone sequence of 
random variables and Y a random variable. 


(i) Suppose that X,(w) 7 X(w), Xn(w) = Yw) for all n and all w € S, and 
noo 
E{Y} > —co. Then 


lim E{X,} = E{X}. 
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(ii) IfX,(w) \ XW), Xn(w) < Yw), for alln and allw € S, and E{Y} < o, 


then 
lim E{X,} = E{X}. 
n—-> oo 
Proof. See Shiryayev (1984, p. 184). QED 


Corollary 1.6.1. Jf X,, X2,... are nonnegative random variables, then 


E {>*.| = SS E{X,}. (1.6.9) 
n=1 n=1 


Theorem 1.6.2 (Fatou). Let X,,n > 1 and Y be random variables. 


(i) If X,(w) = Yw), n = 1, for each w and E{Y} > —on, then 


b {im x, < lim E{X,}; 

noo noo 

(ii) if X,(w) < Yw), n => 1, for each w and E{Y} < @, then 
lim E{X,} < E{ Tim x, ; 
noo noo 


(iii) if |X,(w)| < YW) for each w, and E{Y} < ©, then 


B {tim x,} < lim E{X,} < fim £{X,} < E | Tim X,}. (1.6.10) 
n—->0oo n>oo 


no no 


Proof. (i) 


lim X,(w) = lim inf X,,(w). 
parce n>oom>n 


The sequence Z,(w) = inf Xm(w), n => 1 is monotonically increasing for each w, 
and Z,(w) > Y(w), n => 1. Hence, by Theorem 1.6.1, 


lim E{Z,}=E | lim Zn}. 
n> oo noo 
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Or 


E | tim x, = lim E{Z,} = lim E{Z,} < lim E{Xy}. 
noo TeriO8. noo noo 
The proof of (ii) is obtained by defining Z,(w) = supX,,(w), and applying the 


previous theorem. Part (iii) is a result of (i) and (ii). QED 


Theorem 1.6.3 (Lebesgue Dominated Convergence). Let Y, X, X,,n > 1, be 
random variables such that |X,(w)| < Y(w),n => 1 foralmostallw, and E{Y} < ov. 


Assume also that P {w : lim X,(w) = xow)| = 1. Then E{|X|} < cand 
n—->oo 


lim E{X,} = E | lim Kal; (1.6.11) 
n>oo noo 
and 
lim E{|X, — X|} =0. (1.6.12) 
no 


Proof. By Fatou’s Theorem (Theorem 1.6.2) 


B {im x} < lim E(X,} < Tim £{X,) < E { Tim x, |. 
n—->0o noo 


n> Oo noo 


But since lim X,(w) = X(w), with probability 1, 
n—->>Ooo 
EIXJZE | lim x,| = lim E{X}. 
noo noo 


Moreover, |X(w)| < Y(w) for almost all w (with probability 1). Hence, E {|X|} < oo. 
Finally, since |X,(w) — X(w)| < 2Y(w), with probability | 


lim E{|\X, —X|}=E | lim |X, — x\| =f: 
n—->o n->o 
QED 


We conclude this section with a theorem on change of variables under Lebesgue 
integrals. 


Theorem 1.6.4. Let X be a random variable with respect to (S, F, P). Letg: R-> 
R be a Borel measurable function. Then for each B € B, 


[ seorstaxi= | g(X(w))P{dw}. (1.6.13) 
B X-1(B) 
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The proof of the theorem is based on the following steps. 


1. If A € Band g(x) = J,(x) then 


[ scorctas) = / Ta(x)Px {dx} = Px{AN B} 
B B 
= P{w:X7'(A)N X7|(B)} 
= ‘ g(X(w)) P{dw}. 
X-'(B) 


2. Show that Equation (1.6.13) holds for simple random variables. 
3. Follow the steps of the definition of the Lebesgue integral. 


1.6.2 The Stieltjes-Riemann Integral 


Let g be a function of a real variable and F a distribution function. Let (a, 6] be a 
half-closed interval. Let 


A =X) <X <0 <Xyp-1 <X, = B 


be a partition of (a, 6] to nm subintervals (x;-1, x;],i = 1,...,. In each subinterval 
choose x;, X;-1 < x; < x; and consider the sum 


Sn = D> g(aLF ai) — FQi-1)I- (1.6.14) 
i=1 


If, asm — oo, max |x; — x;_1| > Oandif lim S, exists (finite) independently of the 
l<i<n n—>oo 


partitions, then the limit is called the Stieltjes-Riemann integral of g with respect 
to F. We denote this integral as 


B 
) g(x)dF(x). 


Qa 


This integral has the usual linear properties, i.e., 


B B 
(i) / ce(x)dF(x) = c / e(x)dF(x); 


p B B 
a | (g(a) + econdren) = f aicxoarcx) + f 82(x)dF (x); (1.6.15) 
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and 


B B B 
(ii) ; SOMO MRO)L’ il g(x)dFy(x) +8 / e(x)dF (2x). 


a 


One can integrate by parts, if all expressions exist, according to the formula 


p p 
[ scoares) = 18(6) FB) - gore — [ g(x)F(x)dx, (1.6.16) 


a a 


where g’(x) is the derivative of g(x). If F is strictly discrete, with jump points 
-0 <§ <& <-:-< oO, 


B [o.e) 
/ g(x)dF(x) = S° Iw < &) < B}g&;)p;. (1.6.17) 


j=l 


where p; = F(&;) — F(€;—), j = 1,2,.... If F is absolutely continuous, then at 
almost all points, 


F(x + dx) — F(x) = f(x)dx + o(dx), 


as dx —> 0. Thus, in the absolutely continuous case 


B B 
/ soar) = | B(x) f(x)dx. (1.6.18) 


a 
Finally, the improper Stieltjes-Riemann integral, if it exists, is 


oo B 
i g(x)dF(x) = Jim, i. o(x)dF(x). (1.6.19) 


If B is a set obtained by union and complementation of a sequence of intervals, we 
can write, by setting g(x) = I{x € B}, 


P{B} =) I{x € B\dF(x) 


= / dF(x), 
B 


where F is either discrete or absolutely continuous. 


(1.6.20) 
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1.6.3 Mixtures of Discrete and Absolutely Continuous Distributions 


Let Fy be a discrete distribution and let F,, be an absolutely continuous distribution 
function. Then for alla 0 <a <1, 


F(x) = a Fy(x) + (1 — a) Fye(x) (1.6.21) 


is also a distribution function, which is a mixture of the two types. Thus, for such 
mixtures, if —oo < & < & <--- < oo are the jump points of Fy, then for every 
—00 < y <6 < ooand B= (jy, 4], 


6 
P{B} = i: dF(x) 
3 


‘Ss 5 (1.6.22) 
=a My <& <d}dFa(&j) + - a) | dF yc(X). 
j=l ? 
Moreover, if Bt = [y, 5] then 
P{B*} = P{B} + dFay). 
The expected value of X, when F(x) = pFa(x) + (1 — p)Fac(x) is, 
E{X}= p>. &) faléj) + (ls p| X fac(x)dx, (1.6.23) 


{7} 


where {&;} is the set of jump points of Fy; fa and fa: are the corresponding p.d.f.s. 
We assume here that the sum and the integral are absolutely convergent. 


1.6.4 Quantiles of Distributions 


The p-quantiles or fractiles of distribution functions are inverse points of the distri- 
butions. More specifically, the p-quantile of a distribution function F’, designated by 
x, or F—!(p), is the smallest value of x at which F(x) is greater or equal to p, ie., 


xp = F7\(p) = inf {x : F(x) > p}. (1.6.24) 


The inverse function defined in this fashion is unique. The median of a distribution, 
X.5, is an important parameter characterizing the location of the distribution. The 
lower and upper quartiles are the .25- and .75-quantiles. The difference between 
these quantiles, Rg = x75 — x5, is called the interquartile range. It serves as one 
of the measures of dispersion of distribution functions. 
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1.6.5 Transformations 


From the distribution function F(x) = a Fy(x) + (1 — @)Fac(x), O < a < 1, we can 
derive the distribution function of a transformed random variable Y = g(X), which 
is 


Fy(y) = P{g(X) < y} 


= P{X € By} aes 


=a)" N&; € By}dFa(é;) + (1 — @) i dF ye) 
j=l By 


where 


By = {x : g(x) < y}. 


In particular, if F is absolutely continuous and if g is a strictly increasing differentiable 
function, then the p.d.f. of Y, h(y), is 


d 
fry) = fxg!) (Seto) ; (1.6.26) 


where g~!(y) is the inverse function. If g’(x) < 0 for all x, then 


d 
fry) = fx(g'))- a) (1.6.27) 


Suppose that X is a continuous random variable with p.d-f. f(x). Let g(x) bea 
differentiable function that is not necessarily one-to-one, like g(x) = x. Excluding 
cases where g(x) is a constant over an interval, like the indicator function, let m(y) 
denote the number of roots of the equation g(x) = y. Let €;(y), j =1,...,m(y) 
denote the roots of this equation. Then the p.d.f. of Y = g(x) is 


m(y) 


fro) = Yo fe EiO))- 


j=l 


1 
eeeer teeny 1.6.28 
Ie’(Ej))| : 


if m(y) > O and zero otherwise. 
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1.7 JOINT DISTRIBUTIONS, CONDITIONAL DISTRIBUTIONS 
AND INDEPENDENCE 


1.7.1 Joint Distributions 


Let (X,,..., Xx) be a vector of k random variables defined on the same probability 
space. These random variables represent variables observed in the same experiment. 
The joint distribution function of these random variables is a real value function F 
of k real arguments (&), ..., &) such that 


F(G,,...,&) = P{X1 < b1,..., Xe < Exh. (1.7.1) 


The joint distribution of two random variables is called a bivariate distribution 
function. 
Every bivariate distribution function F has the following properties. 


(i) gun Sty $2) = eee) =0, forall &1, &; 


Gi) lim lim F(é,&) =1; 
or Sate (1.7.2) 
(iii) ei F&, + €, & +6) = F(&, &) forall (&, &2); 


(iv) forany a <b, c<d, F(b,d)— F(a,d) — F(b,c)+ F(a,c) = 0. 


Property (iii) is the right continuity of F(&), &). Property (iv) means that the prob- 
ability of every rectangle is nonnegative. Moreover, the total increase of F(&, &) 
is from 0 to 1. The similar properties are required in cases of a larger number of 
variables. 

Given a bivariate distribution function F. The univariate distributions of X, and 
X> are F; and F> where 


Fi(x) = lim F(x, y), FQ) = lim FC, y). (1.7.3) 


F, and Fy» are called the marginal distributions of X, and X>, respectively. In 
cases of joint distributions of three variables, we can distinguish between three 
marginal bivariate distributions and three marginal univariate distributions. As in the 
univariate case, multivariate distributions are either discrete, absolutely continuous, 
singular, or mixtures of the three main types. In the discrete case there are at most a 
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countable number of points {(&; “ a ales Ev ) ), j = 1,2, ...} on which the distribution 
concentrates. In this case the “oii probability function is 


PIX: =€f,...,.%e =}, if@,....w=EP,...,6) 


P(X, 66 XE) = (cae 
0, otherwise. 
(1.7.4) 
Such a discrete p.d.f. can be written as 
oe . 
PGi OS Yo NG SS EL ones Ep 
where pj = P{X; = £?,..., X, = &0”}. 
In the absolutely continuous case there exists a nonnegative function f(x1,..., xx) 
such that 
& & 
Fen.) = | tee FO, --, XR )dx,... dXg. (1.7.5) 
—oo —0o 
The function f(x;,...,2x,) is called the joint density function. 


The marginal probability or density functions of single variables or of a subvector 
of variables can be obtained by summing (in the discrete case) or integrating, in the 
absolutely continuous case, the joint distribution functions (densities) with respect to 
the variables that are not under consideration, over their range of variation. 

Although the presentation here is in terms of k discrete or k absolutely contin- 
uous random variables, the joint distributions can involve some discrete and some 
continuous variables, or mixtures. 

If X, has an absolutely continuous marginal distribution and X2 is discrete, we 
can introduce the function N(B) on 6, which counts the number of jump points 
of X2 that belong to B. N(B) is a o-finite measure. Let 4(B) be the Lebesgue 
measure on 8. Consider the o-finite measure on B™, (By, x Bo) = 4(B,)N(Bz2). If 
Xj, is absolutely continuous and X2 discrete, their joint probability measure Py is 
absolutely continuous with respect to jz. There exists then a nonnegative function fx 
such that 


i= i / AGN: 


The function fx is a joint p.d.f. of X,, X2 with respect to w. The joint p.d.f. fx is 
positive only at jump point of X2. 
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If X,,..., X; have a joint distribution with p.d.f. f(x,,...,x;), the expected 
value of a function g(X1,..., Xx) is defined as 


FLX cc ROS [sen oo. Xp dF (x1, --. 5 Xe) (1.7.6) 


We have used here the conventional notation for Stieltjes integrals. 

Notice that if (X, Y) have a joint distribution function F(x, y) and if X is discrete 
with jump points of F\(x) at €,, &,..., and Y is absolutely continuous, then, as in 
the previous example, 


[ec nares. = f 86). 0F6. vey 
j=l 


where f(x, y) is the joint p.d.f. A similar formula holds for the case of X, absolutely 
continuous and Y, discrete. 


1.7.2 Conditional Expectations: General Definition 


Let X(w) > 0, for all w € S, be a random variable with respect to (S, F, P). Con- 
sider a o-field G, G C F. The conditional expectation of X given G is defined as a 
G-measurable random variable E{X | G} satisfying 


[ xonptan =f evx | G}w) P{dw} (1.7.7) 
A A 


for all A € G. Generally, E{X | G} is defined if min{E{XT | G}, E{X~ | G}} < co 
and E{X |G} = E{X* | G} — E{X7 | G}. To see that such conditional expectations 
exist, where X(w) > 0 for all w, consider the o-finite measure on G, 


a4) = f xoryPtdw, AeG. (1.7.8) 
A 


Obviously Q < P and by Radon—Nikodym Theorem, there exists a nonnegative, 
G-measurable random variable E{X | G} such that 


O(A) = i, E{X | G\(w)P{dw}. (1.7.9) 


According to the Radon—Nikodym Theorem, E{X | G} is determined only up to a set 
of P-measure zero. 
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If Be F and X(w)= Ip(w), then E{X | G} = P{B |G} and according to 
(1.6.13), 


PLAN B) = | Taw)P(aw} 
? (1.7.10) 
= [ Ps gyPtany. 
A 


Notice also that if X is G-measurable then X = E{X | G} with probability 1. 

On the other hand, if G = {@, S} is the trivial algebra, then E{X | G} = E{X} 
with probability 1. 

From the definition (1.7.7), since S € G, 


E(x) = [ xonP(aw} 
S 
= [ etx | G)P{dw}. 
S 


This is the law of iterated expectation; namely, for all G C F, 

E{X} = E{E{X | G}}. (1.7.11) 
Furthermore, if X and Y are two random variables on (S, F, P), the collection of all 
sets {Y~!(B), B € B}, is a o-field generated by Y. Let Fy denote this o-field. Since 
Y is arandom variable, Fy C F. We define 

E{X | Y} = E{X | Fy}. (1.7.12) 
Let yo be such that fy(yo) > 0. 


Consider the Fy-measurable set As = {w : yo < Y(w) < yo + 6}. According to 
(1.7.7) 


[e-e) yo+s 
: X(w)P(dw) = / Xfxy(x, y)dxdy 
Ab -o0 J yp 


(1.7.13) 
yo+s 
=) E{X |Y =y}fr(y)dy. 
yo 
The left-hand side of (1.7.13) is, if E{|X|} < 00, 
oo yord yor oo 
/ x / furl, ydydx = i Fr(90) [= Pe 
Soe Yo ty (y 0) 
= fr(yo0)d fe Le Yo) 20) ay +0(5), asd—>0 
—0o fy (y 0) 
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6 
where tim? = 0. The right-hand side of (1.7.13) is 


yo+d 
i E{X | Y¥ = y}fr(ydy = E{X | Y = yo} fr(Q0)d + 0(8), as 5 > 0. 


yo 


Dividing both sides of (1.7.13) by fy(yo)6, we obtain that 


E{X |Y = yo} =f Xfxjy (x | yo)dx 
[ Sxy(, Yo) 
— x —————dx. 
—oo fx (0) 
We therefore define for fy(yo) > 0 
Sxy(, yo) 

= 1.7.14 
Fxiy(@ | yo) HOw) ( ) 


More generally, fork > 2 let f(x1,..., x,) denote the joint p.d.f. of (X1,..., Xx). 
Let 1 <r <k and g(x,...,x,) denote the marginal joint p.d.f. of (X1,..., X;). 
Suppose that (&,,...,&,-) is a point at which g(&,...,&-) > 0. The conditional 
p.d.f. of X-41,..., X~ given {X; = &,..., X, = &,} is defined as 


ee ete at ee FG, ee (1.7.15) 


We remark that conditional distribution functions are not defined on points 
(€1,..., &-) such that g(&, ..., &-) = 0. However, it is easy to verify that the proba- 
bility associated with this set of points is zero. Thus, the definition presented here is 


sufficiently general for statistical purposes. Notice that f(%,41,...,%% | &1,...,&) 
is, for a fixed point (&, ..., &,-) at which it is well defined, a nonnegative function of 
(X;41,---, Xx) and that 


| cconmeee a terete ee 


Thus, f(x;41,---,%¢ | &1,..-, &-) is indeed a joint p.d-f. of (X;41,..., Xx). The 
point (€), ..., &-) can be considered a parameter of the conditional distribution. 

If W(X,41,..., Xx) is an (integrable) function of (X,.4),..., X;), the conditional 
expectation of w(X,+1,..., X,) given {X; = &,..., X,; = &-} is 
ELV (Ay es Xk) is sey Set = [vee vee XE )EF (Xr 41, +++ Xe | G1, ---5 &,)- 


(1.7.16) 


This conditional expectation exists if the integral is absolutely convergent. 
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1.7.3. Independence 


Random variables X,,..., X,, on the same probability space, are called mutually 
independent if, for any Borel sets B,,..., Bn, 


Pw : X\(w) € By,..., X,(w) € By} =| | P{w : Xj; € Bi}. (1.7.17) 


j=l 
Accordingly, the joint distribution function of any k-tuple (X;,,..., X;,) 18 a product 
of their marginal distributions. In particular, 
Fy,..x, 01, «+4 %n) = | | Fx, Gs). (1.7.18) 


i=1 


Equation (1.7.18) implies that if X;,..., X, have a joint p.d.f. fx(a,..., X,) and if 
they are independent, then 


far, 4 %n) = [| fx, 0). (1.7.19) 


j=l 


n 
Moreover, if g(X1,..., Xn) = | [sicXp. where g(x1,...,Xn) is B™-measurable 
j=l 
and g(x) are B-measurable, then under independence 


E(g(X1,.-., Xn} =| | Ele(Xp- (1.7.20) 

j=l 
Probability models with independence structure play an important role in statistical 
theory. From (1.7.12) and (1.7.21), we imply that if X = (X,,..., X,) and Y = 
(X,41,.-., Xn) are independent subvectors, then the conditional distribution of xX 


given Y” is independent of Y”, i.e., 


FC gaan 2 | ey Ky) SF Oi HE) (1.7.21) 


with probability one. 


1.8 MOMENTS AND RELATED FUNCTIONALS 
A moment of order r,r = 1, 2,..., of a distribution F(x) is 


bi = BAX: (1.8.1) 
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The moments of Y = X — jp, are called central moments and those of |X| are called 
absolute moments. It is simple to prove that the existence of an absolute moment 
of order r, r > 0, implies the existence of all moments of order s, 0 < 5s <r, (see 
Section 1.13.3). 

Let ut = E{(X — p1)'}, r= 1,2,... denote the rth central moment of a dis- 
tribution. From the binomial expansion and the linear properties of the expectation 
operator we obtain the relationship between moments (about the origin) jz, and center 
moments m,. 


yt = (-1) (5) urs MO (1.8.2) 
i 
j=0 


where [lo = 1. 
A distribution function F is called symmetric about a point & if its p.d-f. is 
symmetric about 0, i.e., 


fo+h)=fo—h), allO<h<o. 


From this definition we immediately obtain the following results. 


(i) If F is symmetric about & and E{|X|} < oo, then & = E{X}. 
(ii) If F is symmetric, then all central moments of odd order are zero, i.e., 
E{(X — E{X})"t!} = 0,m =0,1,..., provided E|X|?"t! < oo. 


The central moment of the second order occupies a central role in the theory 
of statistics and is called the variance of X. The variance is denoted by V{X}. The 
square-root of the variance, called the standard deviation, is a measure of dispersion 
around the expected value. We denote the standard deviation by o. The variance of 
X is equal to 


V{X} = E{X?} — (E{X})’. (1.8.3) 


The variance is always nonnegative, and hence for every distribution having a finite 
second moment E{X7} > (E{X})*. One can easily verify from the definition that if 
X is arandom variable and a and b are constants, then V{a + bX} = b?V{X}. 

The variance is equal to zero if and only if the distribution function is concentrated 
at one point (a degenerate distribution). 

A famous inequality, called the Chebychev inequality, relates the probability of 
X concentrating around its mean, and the standard deviation o. 


Theorem 1.8.1 (Chebychev). Jf Fy has a finite standard deviation o, then, for every 
a>0O, 


2 
Plw :|X(w)— | <a} > 1-2, (1.8.4) 
a 


where uw = E{X}. 
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Proof. 
C= if Ge — py dF x(x) 
2 / Oc — wd x(x) + / (—PdFyo) 185) 
{Ix-p|Sa} {Ix-p|>a} 
> a’ P{w :|X(w) — p| > a}. 
Hence, 
o2 
Plw :|X(w)-—p| < a} = 1— P{w: |X(w)- p| > a} = ere 
“QED 


Notice that in the proof of the theorem, we used the Riemann-Stieltjes integral. The 
theorem is true for any type of distribution for which 0 < 0 < ow. The Chebychev 
inequality is a crude inequality. Various types of better inequalities are available, 
under additional assumptions (see Zelen and Severv, 1968; Rohatgi, 1976, p. 102). 

The moment generating function (m.g.f.) of a random variable X, denoted by 
M, is defined as 


M(t) = Efexp(tX)}, (1.8.6) 


where ¢ is such that M(t) < oo. Obviously, at t = 0, M(O) = 1. However, M(t) 
may not exist when t 4 0. Assume that M(t) exists for all ¢ in some interval (a, b), 
a <0 <b. There is a one-to-one correspondence between the distribution function 
F and the moment generating function M. M is analytic on (a,b), and can be 
differentiated under the expectation integral. Thus 


r 


<M() = E{X’ exp{tX}}, r=1,2,.... (1.8.7) 


Under this assumption the rth derivative of M(t) evaluated at t = 0 yields the 
moment of order r. 

To overcome the problem of M being undefined in certain cases, it is useful to use 
the characteristic function 


b(t) = E{e''*}, (1.8.8) 


where i = ./—1. The characteristic function exists for all t since 


o)| = | i el" dF (x) 


(oe) 


< i |e’ |dF(x) = 1. (1.8.9) 


(oe) 


Indeed, |e'’*| = 1 for all x and all t. 
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If X assumes nonnegative integer values, it is often useful to use the probability 
generating function (p.g.f.) 


GOH. epi (1.8.10) 
j=0 


which is convergent if |t| < 1. Moreover, given a p.g.f. of a nonnegative integer value 
random variable X, its p.d.f. can be obtained by the formula 


a 
Plw:X(w)=k}= Lak Ol -0° (1.8.11) 
The logarithm of the moment generating function is called cumulants generating 
function. We denote this generating function by K. K exists for all t for which M 
is finite. Both M and K are analytic functions in the interior of their domains of 
convergence. Thus we can write for ¢ close to zero 


K(t) = log M(t) = > ae (1.8.12) 
j=0 J 


The coefficients {«;} are called cumulants. Notice that cg = 0, and«;, j = 1, can be 
obtained by differentiating K(t) j times, and setting t = 0. Generally, the relation- 


ships between the cumulants and the moments of a distribution are, for j = 1,...,4 
Kj} = pI 
Ko = M2 — MW = Mh 
‘ (1.8.13) 
3 = [3 — 3popei + 2m] = 13 
ky = wy — 3(u3). 
The following two indices 
Le 
A= (1.8.14) 
o: 
and 
iy 
po = —, (1.8.15) 
oO 


where o? = /43 is the variance, are called coefficients of skewness (asymmetry) and 
kurtosis (steepness), respectively. If the distribution is symmetric, then 6; = 0. If 
B, > O we say that the distribution is positively skewed; if 6; < 0, it is negatively 
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skewed. If 8. > 3 we say that the distribution is steep, and if 8B, < 3 we say that the 
distribution is flat. 

The following equation is called the law of total variance. 

If E{X7} < oo then 


V{X} = E{V{X | Y}} + V{E{X | YH, (1.8.16) 


where V{X | Y} denotes the conditional variance of X given Y. 

It is often the case that it is easier to find the conditional mean and variance, 
E{X | Y} and V{X | Y}, than to find E{X} and V{X} directly. In such cases, formula 
(1.8.16) becomes very handy. 

The product central moment of two variables (X, Y) is called the covariance and 
denoted by cov(X, Y). More specifically 


cov(x, Y) = E{[X — E{X}I[Y — E{Y}]} 
— E{XY} — E{X}E(Y}. 


(1.8.17) 


Notice that cov(X, Y) = cov(Y, X), and cov(X, X) = V{X}. Notice that if X is a 
random variable having a finite first moment and a is any finite constant, then 
cov(a, X) = 0. Furthermore, whenever the second moments of X and Y exist the 
covariance exists. This follows from the Schwarz inequality (see Section 1.13.3), 
ie., if F is the joint distribution of (X, Y) and Fy, Fy are the marginal distributions 
of X and Y, respectively, then 


2 
( / g(s)MFCx,9)) < ( i ed x(0)) ( / nydFv()) (1.8.18) 


whenever E{ g(x )} and E{h?(Y)} are finite. In particular, for any two random vari- 
ables having second moments 


cov’(X, Y) < V{X}V{Y}. 


The ratio 


a cov(X, Y) (1.8.19) 
o = TVX) sa 


is called the coefficient of correlation (Pearson’s product moment correlation). From 


(1.8.18) we deduce that —1 < p < 1. The sign of p¢ is that of cov(X, Y). 
The m.g.f. of a multivariate distribution is a function of k variables 


k 
Ment) =£ foo | rox (1.8.20) 
i=1 
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Let X;,..., X; be random variables having a joint distribution. Consider the linear 
k 
transformation Y = ys f; Xj, where B),..., By are constants. Some formulae for the 
j=l 


moments and covariances of such linear functions are developed here. Assume that 
all the moments under consideration exist. Starting with the expected value of Y we 
prove: 


k k 
E {> ax} = \~ BiE{Xi}. (1.8.21) 
i=l i=1 


This result is a direct implication of the definition of the integral as a linear operator. 

Let X denote a random vector in a column form and X’ its transpose. The expected 
value of a random vector X’ = (X,,..., Xx) is defined as the corresponding vector 
of expected values, i.e., 


E{X’} = (E{Xj},..., E{X,)). (1.8.22) 


Furthermore, let £ denote a k x k matrix with elements that are the variances and 
covariances of the components of X. In symbols 


E = (o;;i, j =1,...,k), (1.8.23) 


where oj; = cov(X;, Xj), oi, = V{Xi}. If Y = B’X where B is a vector of constants, 
then 


ViY} = B' XB 
= a S> BiBio%i 
ij 


(1.8.24) 


k 
= Spor oF ye S-BiBj oi). 
i=l 


iZj 


The result given by (1.8.24) can be generalized in the following manner. Let Y; = p’X 
and Y> = w’X, where w and £ are arbitrary constant vectors. Then 


cov(Y), Y2) =a’ XB. (1.8.25) 
Finally, if X is a k-dimensional random vector with covariance matrix Yand Y is an 
m-dimensional vector Y = AX, where A is an m x k matrix of constants, then the 


covariance matrix of Y is 


V[Y] = AyA’. (1.8.26) 
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In addition, if the covariance matrix of X is %, then the covariance matrix of Y = 
& + AX is V, where é is a vector of constants, and A is a matrix of constants. Finally, 


if Y = AX and Z = BX, where A and B are matrices of constants with compatible 
dimensions, then the covariance matrix of Y and Z is 


C[Y, Z] = AB’. (1.8.27) 


We conclude this section with an important theorem concerning a characteristic 
function. Recall that @ is generally a complex valued function on R, ice., 


b(t) = i a cos(tx)dF(x) + i / * sin(tx)dF(x). 


Theorem 1.8.2. A characteristic function $, of a distribution function F, has the 
following properties. 


() |@| < 60) = 1; 


(ii) $(t) is a uniformly continuous function of t, on R; 


(iii) (t) = d(—t), where Z denotes the complex conjugate of z; 
(iv) $(t) is real valued if and only if F is symmetric around x9 = 0; 
(v) if E{|X|"} < co for some n > |, then the rth order derivative ¢"(t) exists for 
every 1 <r <n, and 


g(t) = / ” ixye dF), (1.8.28) 
Mr = 1 yo), (1.8.29) 
i” 
and 
n . J . n 
g(t) => or ~ CO" Rall) (1.8.30) 


j=l 


where |R,(t)| < 3E{|X|"}, Rn@) — Oast > 0; 
(vi) if ¢°”(0) exists and is finite, then [Lon < 00; 
(vii) if E{|X|"} < o& foralln => 1 and 


Jim (FM) I 
lim = <M, (1.8.31) 


noo nl R 
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then 


ot) = >° o Ln, |t) <R. (1.8.32) 


n=0 


Proof. The proof of (i) and (ii) is based on the fact that |e’’*| = 1 for all ¢ and all x. 
Now, | e'“dF(x) = ¢(-1t) = @(0). Hence (iii) is proven. 
(iv) Suppose F(x) is symmetric around x9 = 0. Then dF(x) = dF(—x) for all x. 
CO 


Therefore, since sin(—tx) = — sin(tx) for all x, / sin(tx )dF(x) = 0, and Q(t) is 


real. If p(t) is real, #(t) = @(f). Hence ¢x(t) = d_x(t). Thus, by the one-to-one 
correspondence between ¢ and F,, for any Borel set B, P{X € B} = P{—X € BJ = 
P{X © —B}. This implies that F is symmetric about the origin. 

(v) If E{|X|”} < oo, then E{|X|"} < o forall 1 < r <n. Consider 


oe +h—-$) _ fers (“ - ‘)| . 
h h 


< |x|, and E{|X|} < co, we obtain from the Dominated Conver- 


: eihx 
Since 


gence Theorem that 


6 = Tim (“ + ® - so) 


; ihX _ 1 
= {e"* lim © 
h>0 h 


=iE{Xe''*}, 


Lia 
Hence pu, = —"(0). 
i 
Equations (1.8.28)—(1.8.29) follow by induction. Taylor expansion of e’” yields 


i = Od CF (cos(oiy) + i sin(Osy)), 
k=0 


where |@;| < 1 and |@2| < 1. Hence 


b(t) = E{e''*} 


n—1 
-y 7, 5 ue + Rul). 
k=0 
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where 
R,(t) = E{X"(cos(0,;tX) + i sin(@.tX) — 1)}. 


Since | cos(ty)| < 1, | sin(ty)| < 1, evidently R,(t) < 3E{|X|"}. Also, by dominated 
convergence, lim R,(t) = 0. 
to 


(vi) By induction on n. Suppose (0) exists. By L’Hospital’s rule, 


(0) = lim ; ke = | #O- >) 


2h 2h 

: 1 

= lim ay Ph) — 26(0) + o(-2h)] 
eitx _ eT thx 2 

= lim (=) dF(x) 

h>0 2h 

ath 2 

=~ Jim (=) xdF(x). 


By Fatou’s Lemma, 
sin(hx) \7 
oO) < - / lim (a) x?dF(x) 
h->0 hx 
= - f Pars) = —[p. 
Thus, 22 < —$(0) < oo. Assume that 0 < j12, < oo. Then, by (v), 
gO" = / (ixy*e dF (x) 
= f edow, 
where dG(x) = x**dF(x), or 
G(x) = i u*dF(u). 
—0o 


: (Io — 
Notice that G(co) = fox < oo. Thus, Gee is the characteristic function of 
ee) 


; > 0, [errarey = [Pace < 00. 
00) 


This proves that (2x < oo forallk = 1,...,n. 


the distribution G(x)/G(oo). Since G 
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——(E{IXi"pv" 1 


(vii) Assuming (1.8.31), if 0 < fo < R, tim. ; ; 
nN. 0 


. Therefore, 


—— (E(x |"}g)'” 
lim ——————_ 
noo n! 


<i. 


By Stirling’s approximation, lim (n!)!/” = 1. Thus, for 0 < t < R, 
n->oo 


nygn\\ 1/n 
Tim (are) <i. 


noo n! 


E{|X|"}4 
Accordingly, by Cauchy’s test, pps 


n=1 


< oo. By (iv), for any n > 1, for any 


t, |t| < to 


y= @ iO iin + RA), 


k=0 


where |R*(t)| < 3H RIX? "\ Thus, for every f, |t| < R, Jim n IR, (t)| = 0, which 


implies that 


p(t) = pean for all |t| < R. 
k=1 


QED 


1.9 MODES OF CONVERGENCE 


In this section we formulate many definitions and results in terms of random vectors 
= (X\, X2,-+- , Xx)’, 1 < k < ~&. The notation ||X]|| is used for the Euclidean 
k 


norm, ice., ||x||? = bee 


i=l 
We discuss here four modes of convergence of sequences of random vectors to a 
random vector. 


(i) Convergence in distribution, X,, ahs X; 
(ii) Convergence in probability, X,, Ne 
(iii) Convergence in rth mean, X,, me X; and 


(iv) Convergence almost surely, X, —> X. 
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A sequence X,, is said to converge in distribution to X, X, —~> X if the corre- 
sponding distribution functions F,, and F satisfy 


Jim, f goody (x) = [ ecoares (1.9.1) 


for every continuous bounded function g on R*. 
One can show that this definition is equivalent to the following statement. 


A sequence {X,,} converges in distribution to X, X,, aS X if lim F,(x) = F(x) 
noo 
at all continuity points x of F. 


If X,, —“. Xwe say that F,, converges to F weakly. The notation is F,, rm 
or F,, => F. 
We define now convergence in probability. 


A sequence {X,,} converges in probability to X, X, a if, for eache > 0, 
noo 
lim P{||X, — X|| > €} = 0. (1.9.2) 
noo 


We define now convergence in rth mean. 


A sequence of random vectors {X,} converges inrthmean,r > 0, to X, X, —.x 
if E{||X, — X||"} > Oasn > o. 
A fourth mode of convergence is 


A sequence of random vectors {X,} converges almost-surely to X, X, ee. as 
n> wif 


P{lim X, = X}=1. (1.9.3) 
n—>oo 


The following is an equivalent definition. 
X, —> Xasn— oo if and only if, for any € > 0, 


lim P{||X,, —X|| <¢€, Vn >n}=1. (1.9.4) 
n->oo 
Equation (1.9.4) is equivalent to 
P{ lim ||X, — X|| <e€} = 1. 
n—>oo 
But, 


P{ lim ||X, — X|| < €} = 1 — P{ lim ||X, — X|| > €} 
n->0o n—> oo 


=1-—P{||X,-—X||><¢, io}. 
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By the Borel—Cantelli Lemma (Theorem 1.4.1), a sufficient condition for 
X, —> Xis 

CO 

Y> P{IIXn — XI] = €} < 00 (1.9.5) 


n=1 


for alle > 0. 
Theorem 1.9.1. Let {X,} be a sequence of random vectors. Then 


(a) X, > XimpliesX, > X. 
(b) X, —> Xr >0, impliesX, > X. 
(c) X, > XimpliesX, —S X. 


Proof. (a) Since X,, “SX, for any € > 0, 


0 = P{ lim ||X, — X|| > €} 
n—->o 


Jim, P {U [Xn — Xi = (1.9.6) 


m>n 


IV 


lim P{||X, — X|| > €}. 
n->oo 


The inequality (1.9.6) implies that X, —> X. 
(b) It can be immediately shown that, for any € > 0, 


E{\|Xn — X||"} = €” P{|[X, — X]| = €}. 
Thus, X, —~> X implies X, —> X. 
(c) Let ¢ > O. If X, < xo then either X < xy + ¢€1, where 1=(1,..., 1)’, or 
||X,, — X|| > e. Thus, for all n, 
F (Xo) < F(X + €1) + P{||X, — X|| > €}. 
Similarly, 
F(X — €1) < Fy(X%o) + P{||Xn — XI > €}. 


Finally, since X,, ans X, 


F(Xo — €1) < lim F,(x) < lim F,(Xo) < F(x + €D). 
n-o 


noo 
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Thus, if Xp is a continuity point of F, by letting « — 0, we obtain 


lim F,,(X9) = F (Xo). 
noo 


QED 
Theorem 1.9.2. Let {X,,} be a sequence of random vectors. Then 


(a) ifee R*, then X, eg c implies X,, ey Cc; 
as. 


(b) ifX, —> Xand ||X,||" < Z, for some r > 0 and some (positive) random 
variable Z, with E{Z} < o, then X,, eee 


For proof, see Ferguson (1996, p. 9). Part (b) is implied also from Theorem 1.13.3. 


Theorem 1.9.3. Let {X,,} be a sequence of nonnegative random variables such that 
X, —> X and E{X,} > E{X}, E{X} < 00. Then 


E{\X, -X]} > 0, asn>o. (1.9.7) 


Proof. Since E{X,}— E{X} < oo, for sufficiently large n, E{X,,} < oo. For 
such n, 


E{|Xn — X|} = E{(X — Xn)I{X = Xn}} + E{(Xn — XX > XF} 
= 2E{((X — X,)I{X > Xp}} + E(X — Xn}. 


But, 
O<(X —X,)1{X => X,} < X. 
Therefore, by the Lebesgue Dominated Convergence Theorem, 
im E((X — X,)I{X > X,}} = 0. 


This implies (1.9.7). QED 
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1.10 WEAK CONVERGENCE 


The following theorem plays a major role in weak convergence. 


Theorem 1.10.1. The following conditions are equivalent. 


(a) X, —> X; 

(b) E{g(X,)} > E{gCX}, for all continuous functions, g, that vanish outside a 
compact set; 

(c) E{g(X,)} > E{g(X)}, for all continuous bounded functions g; 

(d) E{g(X,)} > Ef{gCX}, for all measurable functions g such that P{X € 
C(g)} = 1, where C(g) is the set of all points at which g is continuous. 


For proof, see Ferguson (1996, pp. 14-16). 


Theorem 1.10.2. Let {X,,} be a sequence of random vectors in R*, and X, a 
Then 
() £(X,) S £00; 
(ti) if {Y,} is a sequence such that X, — Y,, aes 0, then Y, a X; 
(iii) ifX, €¢ Rk and Y, € BR! andY, —> ¢, then 


Proof. (i) Let g : R! > R be bounded and continuous. Let h(x) = g(f(x)). If x is 
a continuity point of f, then x is a continuity point of h, i.e., C(f) C C(h). Hence 
P{X € C(h)} = 1. By Theorem 1.10.1 (c), itis sufficient to show that E{g(f(X,,))} > 
E{g(f(X))}. Theorem 1.10.1 (d) implies, since P{X € C(h)} = 1 and X, ay X, 
that E{h(X,,)} > E{hCX)}. 

(ii) According to Theorem 1.10.1 (b), let g be a continuous function on R* 
vanishing outside a compact set. Thus g is uniformly continuous and bounded. 
Let € > 0, find 6 > 0 such that, if ||x — y|| < 6 then |g(x) — g(y)| < €. Also, g is 
bounded, say |g(x)| < B < oo. Thus, 


IE{g(¥n)} — Et{g(X)}I < |E{g(¥n)} — Efg(Xn)} + [E{g(&n)} — EfgCotl 
S Eflg(¥n) — gCXn)lL{|[Xn — Yall < 5h} 
+ Ef|g(¥n) — g(Xn)lL{[[Xn — Ynll > 5} 
+ |E{g(Xn)} — E{g(X)} 
<¢€+2BP{||X, — Yn|| > 4} 
+ |E{g(Xn)} — E{g@O}| — «. 


noo 
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Hence Y,, ae X. 


(iii) 
Ptl)-Ce) > e} = PU, ~ell > €) > 0, as n > 00. 
Yn c 
7 Xn d x 
Hence, from part (ii), Oa ree. Cy a 


As a special case of the above theorem we get 


Theorem 1.10.3 (Slutsky’s Theorem). Let {X,,} and {Y,,} be sequences of random 


variables, Xn om X and Y, —s ¢ Then 


(i) X,+Y%, “> X+e; 


és d 
(ii) XnY¥, —> cX; (1.10.1) 


wise 2 Xn a xX 
(iii) if c#0 then x. — —. 


n Cc 


A sequence of distribution functions may not converge to a distribution function. 
For example, let X,, be random variables with 


F(x) = 


1 1 
Then, lim F,(x) = = for all x. F(x) = = for all x is not a distribution function. In 
n—-> Co 


this example, half of the probability mass escapes to —oo and half the mass escapes 
to +oo. In order to avoid such situations, we require from collections (families) of 
probability distributions to be tight. 

Let F = {F,,u € U} bea family of distribution functions on R*. F is tight if, for 
any € > 0, there exists a compact set C C R* such that 


sup | I{x € R* — C}dF,(x) < €. 
ucu 


In the above, the sequence F,,(x) is not tight. 
If F is tight, then every sequence of distributions of F contains a subsequence 
converging weakly to a distribution function. (see Shiryayev, 1984, p. 315). 
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Theorem 1.10.4. Let {F,,} be a tight family of distribution functions on R. A nec- 
essary and sufficient condition for F, = F is that, for each t € R, lim ¢,(t) exists, 
n—-> Oo 


where y(t) = ferar, (x) is the characteristic function corresponding to Fy. 


For proof, see Shiryayev (1984, p. 321). 


Theorem 1.10.5 (Continuity Theorem). Let {F,,} be a sequence of distribution 
functions and {,} the corresponding sequence of characteristic functions. Let F be 
a distribution function, with characteristic function @. Then F,, => F if and only if 
on(t) > $(t) for all t € R*. (Shiryayev, 1984, p. 322). 


1.11 LAWS OF LARGE NUMBERS 


1.11.1 The Weak Law of Large Numbers (WLLN) 


Let X,, Xo, ... be a sequence of identically distributed uncorrelated random vectors. 
Let w = E{X} and let P= E{(X, — )(X, — py} be finite. Then the means X, = 
1 n 


-\ Xi converge in probability to t, 1.¢., 
n 


i=1 
XK, w ano>w.X, > w an->o. (1.11.1) 
The proof is simple. Since cov(X,, X,) = 0 for all n 4 n’, the covariance matrix of 
7 1 ce 
X,, is —Z. Moreover, since E{X,,} = pu, 
n 


= 1 
E{\|X, — wl|?} = -tr{Z}>0 as n> oo. 
n 


Hence X,, an 4, which implies that X, ery wu. Here tr.{Z} denotes the trace 
of ©. 

If X;, Xo, ... are independent, and identically distributed, with E{X,} = m, then 
the characteristic function of X,, is 


x, () = (0 (<)) ; (1.11.2) 


where (t) is the characteristic function of X,. Fix t. Then for large values of n, 


t [ 1 
6(£)=14 en +0(7), as n> Oo. 
n n n 
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Therefore, 
iy LV \" it’ 
ox, =(1+—-tut+o|— => tk. (1.11.3) 
n n 


p(t) = et is the characteristic function of X, where P{X = mw} = 1. Thus, since 
et” is continuous at t = 0, X, any pu. This implies that X,, Sats me (left as an 
exercise). 


1.11.2 The Strong Law of Large Numbers (SLLN) 


Strong laws of large numbers, for independent random variables having finite 
expected values are of the form 


ix as. 

-) (Kaw) SS. 0, as. wv 06, 
n 

i=l 


where tl; = E{X;}. 


Theorem 1.11.1 (Cantelli). Let {X,,} be a sequence of independent random vari- 
ables having uniformly bounded fourth-central moments, i.e., 


0<E(X,— t,y = C a 00 (1.11.4) 
for alln => 1. Then 
1 7 as. 
— Si ayy sO, (1.11.5) 
n i=l noo 


Proof. Without loss of generality, we can assume that 4, = E{X,} = Oforalln > 1. 
ig : lf{< 
E{\(-) X%) }=— E{Xx} 
(FE) foal em 
+457 SCE{X? Xj} +39) ) EX? X3} 


ij ixj 
4 6° Ss SC E{X?XjXx} as > Ly y Dex.) xx0| 
it jek if jtkAl 


n 
2 yee: Eas Sro2o?, 
i=1 


ixj 
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where j14,; = E{X}} and o7 = E{X?}. By the Schwarz inequality, 0707 < (114,i - 
jis : ; y q Y, 0; 9; La, 
ji4;)'/? for all i # j. Hence, 


p ai 1 
E{X4} < 2 je <= of 5) 
n n n 


By Chebychev’s inequality, 


P(X) S aS P(X se} 
< FAX 


«4 


Hence, for any € > 0, 


where C* is some positive finite constant. Finally, by the Borel—Cantelli Lemma 
(Theorem 1.4.1), 


P{|X,| > €,i.0.} = 0. 
Thus, P{|X,| <<, i.0.} = 1. QED 


Cantelli’s Theorem is quite stringent, in the sense, that it requires the existence of 
the fourth moments of the independent random variables. Kolmogorov had relaxed 
this condition and proved that, if the random variables have finite variances, 0 < 

2 


0, < oo and 


O, 
ys < 00, (1.11.6) 


1X ae 

then xe Li) “. Oasn > oo. 
n 
i=l 


If the random variables are independent and identically distributed (1.i.d.), then 
Kolmogorov showed that E{|X |} < oo is sufficient for the strong law of large 
numbers. To prove Kolmogorov’s strong law of large numbers one has to develop 
more theoretical results. We refer the reader to more advanced probability books (see 
Shiryayev, 1984). 
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1.12 CENTRAL LIMIT THEOREM 


The Central Limit Theorem (CLT) states that, under general valid conditions, the 
distributions of properly normalized sample means converge weakly to the standard 
normal distribution. 

A continuous random variable Z is said to have a standard normal distribution, 
and we denote it Z ~ N(0, 1) if its distribution function is absolutely continuous, 
having a p.d.f. 


1 3 
f(x) = er 99 <x <0. aban by 


J2n 


The c.d.f. of N(0, 1), called the standard normal integral is 


1 x 
(x) = —— e72) dy. (1.12.2) 
Vv 21 J—oo 


The general family of normal distributions is studied in Chapter 2. Here we just 
mention that if Z ~ N(O, 1), the moments of Z are 


Co if r = 2k 
pad ee? (1.12.3) 
0, ifr =2k+1. 
The characteristic function of N(0, 1) is 
ot) = / Setting 
= — e x 
27 J—co (1.12.4) 
=e. soo <p 200: 
Arandom vector Z = (Z,,..., Z;,) is said to have a multivariate normal distribu- 


tion with mean «x = E{Z} = 0 and covariance matrix V (see Chapter 2), Z ~ N(0, V) 
if the p.d.f. of Z is 


: = 1 1 17-1 
FLV) = ae P| Re 2}. 


The corresponding characteristic function is 
1 / 
oz(t) = exp = Vtr, (1.12.5) 


te R*. 
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Using the method of characteristic functions, with the continuity theorem we 
prove the following simple two versions of the CLT. A proof of the Central Limit 
Theorem, which is not based on the continuity theorem of characteristic functions, 
can be obtained by the method of Stein (1986) for approximating expected values or 
probabilities. 


Theorem 1.12.1 (CLT). Let {X,} be a sequence of i.i.d. random variables having 
a finite positive variance, i.e., u = E{X 1}, V{X\} = 07, 0 < 0” < oo. Then 


X= 
ip SE ROW. BER SS: (1.12.6) 
oO 


TH, 
,1 > 1. Moreover, 


Xj 
HT, where Z; = 


E{Z;} = Oand V{Z;} = 1,i > 1. Let b(t) be the characteristic function of Z,. Then, 
since E{Z} = 0, V{Z} = 1, (1.8.33) implies that 


re) 
oz(t) = 1—- oy +o(t), as t—> 0. 


Accordingly, since {Z,,} are i.i.d., 


bn 7,(t) =o (=) 


Hence, /n Z, —> N(O, 1). QED 


- 1X 
Theorem 1.12.1 can be generalized to random vector. Let X,, = -)°x pune. 
n 


j=l 
The generalized CLT is the following theorem. 


Theorem 1.12.2. Let {X,,} be a sequence of i.i.d. random vectors with E{X,} = 0, 


and covariance matrix E{X,X',} = V,n = 1, where V is positive definite with finite 
eigenvalues. Then 


JnX, —> N(0,V). (1.12.7) 
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Proof. Let $x(t) be the characteristic function of X,. Then, since E{X,} = 0, 


t 
bya x,(D = ox (=) 


as n — oo. Hence 


. 1 / 
Jim, yas.) = exp |—5¢°¥4), te R. 
QED 


When the random variables are independent but not identically distributed, we 
need a stronger version of the CLT. The following celebrated CLT is sufficient for 
most purposes. 


Theorem 1.12.3 (Lindeberg—Feller). Consider a triangular array of random vari- 
ables {Xn x}, k= 1,...,n, n= 1 such that, for each n> 1, {Xnx,k =1,...,n} 


are independent, with E{Xy x} =0 and V{Xnx} = OF y Let S, = Pe: and 


k=1 
n 


B2 = DD Assume that B, > 0 for each n > 1, and B, 7 ™, asn—> ov. If 
k=1 
for every € > 0, 


1 n 
— Y E{X AL Xn kl > €Bn} > 0 (1.12.8) 
k=1 


On, 
— + 0as 


asn — ©, then S,/B, mre N(O, 1) as n > oo. Conversely, if max 
“ 1<k<n B? 


n—> coand S,/B, —-> N(0, 1), then (1.12.8) holds. 
For a proof, see Shiryayev (1984, p. 326). The following theorem, known as 


Lyapunov’s Theorem, is weaker than the Lindeberg—Feller Theorem, but is often 
sufficient to establish the CLT. 
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Theorem 1.12.4 (Lyapunov). Let {X,} be a sequence of independent random 
variables. Assume that E{X,} = 0, V{X,} > 0 and E{|Xn|°} < 0, for alln > 1. 


Moreover, assume that B? = YS oVixX;)} /) ©. Under the condition 
j=l 


1 n 
re wae as n> 00, (1.12.9) 
n j=l 


the CLT holds, i.e., S,/By —“> N(O, 1) asn—> ov. 


Proof. It is sufficient to prove that (1.12.9) implies the Lindberg—Feller condition 
(1.12.8). Indeed, 


E{|X;|3} = |x PdF j(x) 


> i; Ix BaF (x) 
{x:|x|>e€ Bn} 
= cB, | x°dF j(x). 
{x:|x|>Bne} 


Thus, 


one 1 
et . a) SIMS oe E{|Xj|°} > 0. 
n j=l XX) >€ Dyn n 


QED 


Stein (1986, p. 97) proved, using a novel approximation to expectation, that if 
X,, X2,... are independent and identically distributed, with EX, = 0, EX : = land 
y= E{|X,|>} < oo, then, for all —coo < x < wandalln = 1,2,..., 


1 n 
|e <1| — &(x) 


where ®(x) is the c.d.f. of N(O, 1). This immediately implies the CLT and shows that 
the convergence is uniform in x. 


6y 


SS 


Te 


1.13. MISCELLANEOUS RESULTS 


In this section we review additional results. 
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1.13.1 Law of the Iterated Logarithm 
We denote by log,(x) the function log(log(x)), x > e. 


Theorem 1.13.1. Let {X,} be a sequence of i.i.d. random variables, such that 


E{X,} = Oand V{X\} = 07, 0 <o < 00. Let S, = ) Xj. Then 
i=1 


P| lim Sal | =i} =I, (1.13.1) 
noo y(n) 


where w(n) = (207n log,(n))!, n> 3. 


For proof, in the normal case, see Shiryayev (1984, p. 372). 

The theorem means the sequence |S,,| will cross the boundary y(7), n > 3, only a 
finite number of times, with probability 1, as n — oo. Notice that although E{S,} = 
0, n > 1, the variance of S, is V{S,} =no? and P{|S,| 7 oo} = 1. However, if 


n a.s. 


Sn S; 
we consider — then by the SLLN, — —+> 0. If we divide only by ./n then, by 
n 


Sn 
aik 
e>0,P 7a 2 log, (n), ‘ = 0. This means, that the fluctuations 
of S, are not ee wild. In Example 1.19 we see that if {X,,} are iid. with P{X, = 
y= P{xX,;=-l= . then * ““. 0 asn— oo. Butn goes to infinity faster 
Sn 


than ,/n log,(n). Thus, by (1.13.1), if we consider the sequence W,, = JaeEa 
n log,(n 
then P{|W,| < 1+ €,i.o.} = 1. {W,} fluctuates between —1 and 1 almost always. 


1.13.2 Uniform Integrability 


A sequence of random variables {X,,} is uniformly integrable if 


lim sup E{|X,|1{|Xn| > c}} = 0. (1.13.2) 
CFO p>] 


Clearly, if |X,| < Y foralln > 1 and E{Y} < o, then {X,,} is a uniformly integrable 
sequence. Indeed, |X,|/{|Xn| > c} < |Y|/{|Y| > c} for alln > 1. Hence, 


sup E{|Xn|1{|Xnl > c}} < E{IV|{IY] > c}} > 0 


n>1 


asc — oo since E{Y} < o@. 
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Theorem 1.13.2. Let {X,} be uniformly integrable. Then, 
(i) E{ lim X,} < lim E{X,} < lim E{X,} < E{ lim X,); 
noo noo n> oo noo 


(ii) if in addition X, “SX, asn — 0, then X is integrable and 


lim E{X,} = E{X}, 
n—->>oo 
lim E{|X, — X|} =0. 
noo 


Proof. (i) For every c > 0 
By uniform integrability, for every « > 0, take c sufficiently large so that 


sup |E{X,1{X, < —c}}| <. 


n>1 


By Fatou’s Lemma (Theorem 1.6.2), 


lim E{X,1{Xp > —c}} > E | lim Xj 1{X, > -a} 


no no 


But X,1{X, => —c}} > X,. Therefore, 


lim E{X,1{X, > —c}} > E{ lim X,}. 


noo noo 
From (1.13.6)—(1.13.8), we obtain 
lim E{X,} > b {im x} aie 
n—>0o noo 
In a similar way, we show that 


lim E{X,} < E{limX,} +. 
n—>©o 
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(13.3) 


(1.13.4) 


(1.13.5) 


(1.13.6) 


(1.13.7) 


(1.13.8) 


(1.13.9) 


(1.13.10) 


Since € is arbitrary we obtain (1.13.3). Part (ii) is obtained from (i) as in the Dominated 


Convergence Theorem (Theorem 1.6.3). 


QED 


Theorem 1.13.3. Let X,, > 0,n > 1,andX, “> X, E{X,} < co. Then E{X,,} > 


E{X} if and only if {X,} is uniformly integrable. 
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Proof. The sufficiency follows from part (ii) of the previous theorem. 
To prove necessity, let 


A = {a: Fx(a) — Fx(a—) > O}. 
Then, for eachc ¢ A 
X,1{X_ <c} —> XI{X <c}. 
The family {X,/{X, < c}} is uniformly integrable. Hence, by sufficiency, 
jim E{Xnt {Xn <c}} = E{X1{X < c}} 


forc ¢ A,n — oo. A has acountable number of jump points. Since E{X} < 00, we 
can choose co ¢ A sufficiently large so that, fora givene > 0, E{XI{X > co}} < =: 
Choose No(€) sufficiently large so that, forn > No(eé), 


E{X,1{Xq > col} < E{XI{X > co}} + =. 


Choose c; > co sufficiently large so that E{X,I{X, => ci}} <€, n < No. Then 
sup E{X,T{Xn = ci}} < €. QED 


Lemma 1.13.1. Jf {X,} is a sequence of uniformly integrable random variables, 
then 


sup E{|X,|} < oo. (1.13.11) 


n>1 


Proof. 
ue E{|X|n} = sup(Et|Xnl F(X > ch} + E{|Xn|1{|Xnl < ch) 


S sup E{|Xn|1{|Xnl > eh} + sup E{|Xnlt{1Xn| <c}} 


n>1 n> 


<e+t+e, 


for 0 < c < & sufficiently large. QED 
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Theorem 1.13.4. A necessary and sufficient condition for a sequence {X,} to be 
uniformly integrable is that 


sup E{|Xn|} < B < 00 (1:13.12) 
n>1 


and 


sup E{|X,|[4} > 0 when P{A} > 0. (1.13.13) 


n>1 


Proof. (i) Necessity: Condition (1.13.12) was proven in the previous lemma. Fur- 
thermore, for any 0 < c < on, 


E{|Xnla} = Et|Xnl {A (|Xnl 2 ch} 


+ E{|Xn|T{AN {|Xn| < ch} (1.13.14) 
S Et|Xn|T{|Xn] = ch} + cP(A). 


Choose c sufficiently large, so that E{|X,|Z{|Xn| => c}} < 5 and A so that P{A} < 


= then E{|X,,|I4} < €. This proves the necessity of (1.13.13). 
Cc 


(ii) Sufficiency: Let « > 0 be given. Choose 6(€) so that P{A} < d(€), and 
sup E{|Xn|la} <€. 
n>1 


7 By Chebychev’s inequality, for every c > 0, 


E{|Xn|} 
P{|X,| = c} < ——,, 
c 
Hence, 
1 B 
sup P{|X,| = c} < — sup E{|X,|} < —. (1.13.15) 
n>1 C n> c 


The right-hand side of (1.13.15) goes to zero, when c — oo. Choose c sufficiently 
large so that P{|X,,| > c} < €. Such a value of c exists, independently of n, due to 


CO 
(1.13.15). Let A = xn > ¢ sufficiently large c, P{A} < € and, therefore, 


n=1 


sup E{|X,|T{|Xn| = ch} < E{|XnlLa4} > 0 
n>1 


as c > oo. This establishes the uniform integrability of {X,,}. QED 
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Notice that according to Theorem 1.13.3, if E|X;|" < oo,r > land X, See Oy, 
lim E{X"} = E{X’} if and only if {X,,} is a uniformly integrable sequence. 
noo 


1.13.3 Inequalities 


In previous sections we established several inequalities. The Chebychev inequal- 
ity, the Kolmogorov inequality. In this section we establish some useful additional 
inequalities. 


1. The Schwarz Inequality 

Let (X, Y) be random variables with joint distribution function Fyy and marginal 
distribution functions Fy and Fy, respectively. Then, for every Borel measurable and 
integrable functions g and h, such that E{g*(X)} < oo and E{h?(Y)} < oo, 


1/2 1/2 
= (/ dF x(0)) (/ nydFy()) . (1.13.16) 


To prove (1.13.16), consider the random variable Q(t) = (g(X) + th(Y))?, —oo < 
t < oo. Obviously, Q(t) = 0, for all t, —oo < t < co. Moreover, 


if 8(x)h(y)dF xy(x, y) 


E{Q(t)} = E{g?(X)} + 2tE{g(X)A(Y)} + PE{h(Y)} = 0 
for all t. But, E{Q(t)} > 0 for all t if and only if 
(E{g(X)h(Y)}Y < E{g*(X)E{h?(Y)}. 
This establishes (1.13.16). 
2. Jensen’s Inequality 


A function g:R— R is called convex if, for any —oo < x < y < oo and 
0<a <i, 


g(ax + (1 — @)y) < ag(x) + (1 — @)g(y). 
Suppose X is a random variable and E{|X|} < oo. Then, if g is convex, 
S(E{X}) < E{g(X)}. (1.13.17) 


To prove (1.13.17), notice that since g is convex, for every x9, —OO < Xp < 00, g(x) => 
g(xo0) + (x — x0)g*(%o) for all x, —oo < x < 00, where g*(xo) is finite. Substitute 
xo = E{X}. Then 

g(X) = g(E{X}) + (X — E{X})g"(E{X}) 


with probability one. Since E{X — E{X}} = 0, we obtain (1.13.17). 
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3. Lyapunov’s Inequality 
If0 <s <rand E{|X|"} < ~, then 


(E{IX[)* < (BUX Pp”. (1.13.18) 


To establish this inequality, let t = r/s. Notice that g(x) = |x|‘ is convex, sincet > 1. 
Let € = E{|X|*}, and (|X |*°)’ = |X|". Thus, by Jensen’s inequality, 


e(E) = (E|X|')"* < Efg(|Xl} 
= E{|X|"}. 


Hence, E{|X|°}!/5 < (E{|X|"})!". As a result of Lyapunov’s inequality we have the 
following chain of inequalities among absolute moments. 


E{|X|} < (E{X*)!? < (E{|XPpl? <.---. (1.13.19) 


4. Hélder’s Inequality 
Te Al 
Let 1 < p<oo and 1 <q <o, such that —-+—=1. E{|X|?} < oo and 
P sq 
E{|Y|2} < oo. Then 


E{\XY|} < (E(\X/?})'/? (EY 9)". (1.13.20) 


Notice that the Schwarz inequality is a special case of Holder’s inequality for p = 
q=2. 
For proof, see Shiryayev (1984, p. 191). 


5. Minkowsky’s Inequality 


If E{|X|?} < coand E{|Y|?} < coforsomel < p < ow, then E{|X + Y|?} <a 
and 


(E{IX + Y|P})/? < (E[X|P)/? + (E{IY|?})'/?. (1.13.21) 


For proof, see Shiryayev (1984, p. 192). 


1.13.4 The Delta Method 


The delta method is designed to yield large sample approximations to nonlinear 
functions g of the sample mean X,, and its variance. More specifically, let {X,,} be 
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a sequence of i.i.d. random variables. Assume that 0 < V{X} < oo. By the SLLN, 


a , Ir X,- 
xX, —> ,asn— oo, where X, = ~)°X;, and by the CLT, /n — Es 
n (ox n—>co 


j=l 
N(O, 1). Let g: R— R having third order continuous derivative. By the Taylor 
expansion of g(X,,) around jz, 


5) , a) ls 2,2) 
g(Xn) = gl) + (X, — wge (uw) + (Xn — WB (uw) + R,, (1.13.22) 


where R,, = ah — pw) g°(u*), where j* is a point between X,, and yi, i.e., |X, — 
ux| < |X, — |. Since we assumed that g®(x) is continuous, it is bounded on the 
closed interval [1 — A, 4 + A]. Moreover, g°(u*) > g(w), as n > oo. Thus 
R, > 0, as n > 00. The distribution of g(j2) + g()(X, — 14) is asymptotically 


ee 2 = d 

N(g(H4), (g?(u))°o?/1). (Xn — wy? > 0, as n > 00. Thus, /n(g(Xn) — g(u)) > 
NCO, o7(g())). Thus, if X,, satisfies the CLT, an approximation to the expected 
value of ¢(X,,) is 


E{g(X,)} = o° (2) 
8(Xn)} = Bu) + 5 8 (u). (1.13.23) 


An approximation to the variance of g(X,,) is 


2 o2 
V{g(X,)} = — (gO). (1.13.24) 
Furthermore, from (1.13.22) 
Vn(g(Xn) — (He) = Va Xn — wg) + Dn, (1.13.25) 
where 


X, — py? 
Dn = AoE yt), (1.13.26) 


and |u** — X,| < | — X,,| with probability one. Thus, since X, — 4 > 0 a.s., as 


n — oo, and since |g(u**)| is bounded, D, —’s 0,asn — oo, then 


g(X,)-— gu) a 
olg%(u)I 


> N(0, 1). (1.13.27) 
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1.13.5 The Symbols 0, and O, 


Let {X,,} and {Y,,} be two sequences of random variables, Y, > 0 a.s. for alln > 1. 
We say that X, = 0,(Y;,), 1.€., Xp is of a smaller order of magnitude than Y,, in 
probability if 


— — 0 anon. (1.13.28) 


We say that X,, = O,(Y,,), i.e., Xn has the same order of magnitude in probability as 


xX 
Y,, if, for all € > O, there exists K, such that sup, P IF 
n 


> KJ <€. 


One can verify the following relations. 


(i) op(1)+ Op) = O,(), 
(ii) O,()+ 0,0) = O,(), 
(iii) 0,(1) +.0)(1) = 0,(1), (1.13.29) 
(iv) O,()+ O,C) = O,(), 
(v) 0,(1): Op) = 0p (1). 


1.13.6 The Empirical Distribution and Sample Quantiles 


Let X,, Xo,..., X, bei.i.d. random variables having a distribution F'. The function 
mie Sis } (1.13.30) 
nr) = - a ix 1S. 


is called the empirical distribution function (EDF). 
Notice that E{J{X; < x}} = F(x). Thus, the SLLN implies that at each x, 


F,,(x) —“. F (x) as n — oo. The question is whether this convergence is uniform 
in x. The answer is given by 


Theorem 1.13.5 (Glivenko—Cantelli). Let X,, X2, X3,... be i.i.d. random vari- 
ables. Then 


sup |F,(x) — F(x)| —> 0, asn = oo. (1.13.31) 


—00<X<0O 


For proof, see Sen and Singer (1993, p. 185). 
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The pth sample quantile x, is defined as 


n,p Fy! 
nae) (1.13.32) 
= inf{x : F,(x) = p} 


for 0 < p < 1, where F,,(x) is the EDF. When F(x) is continuous then, the points of 
increase of F,,(x) are the order statistics X(q:n) < +++ < X(n:n) with probability one. 
i 
Also, Fy(X(i:n)) = —,i = 1,...,n. Thus, 
n 
Xn.p = Xi(p):n)s where 


(1.13.33) 
i(p) = smallest integer i such that i > pn. 


Theorem 1.13.6. Let F be a continuous distribution function, and &, = F —l(p), 
and suppose that F(&)) = p and for any € > 0, F(&p —€) < p < F(&) + ©). Let 
X1,.---,; Xn be i.i.d. random variables from this distribution. Then 


a.s. 
Xn,p —> §) asn— o., 


For proof, see Sen and Singer (1993, p. 167). 
The following theorem establishes the asymptotic normality of x,,p. 


Theorem 1.13.7. Let F (x) be an absolutely continuous distribution, with continuous 
p.df. f (x). Let p,0 < p < 1,&) = F~'(p) and f (E,) > 0. Then 


pil — ») 


ae 1.13.34 
PE») ese. 


JS1(Xn,p — &y) ts N (0 


For proof, see Sen and Singer (1993, p. 168). 
The results of Theorems 1.13.6—1.13.7 will be used in Chapter 7 to establish the 
asymptotic relative efficiency of the sample median, relative to the sample mean. 
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Example 1.1. We illustrate here two algebras. 
The sample space is finite 


Si. 10K 


Let E, = {1, 2}, E> = {9, 10}. The algebra generated by FE, and E>, Aj, contains the 
events 


A; = {S,9, E), E\, Ex, Ex, Ey U Eo, Ey U Ep}. 
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The algebra generated by the partition D = {E,, FE, E3, E4}, where F, = {1, 2}, 
E> = {9, 10}, E3 = {3,4, 5}, Es = {6, 7, 8} contains the 2* = 16 events 


Ay = {S, 0, E), Ex, E3, Es, Ej U Ex, E; U E3, E; U Ey, Ep U E3, Ey U Es, 
Ey U Ey, Ey U Ey U E3, Ey U Ey U Ey, E,U EU Ey, Ep U Es U Ey}. 


Notice that the complement of each set in A, is in Az. A; C Az. Also, Az 
Cc AS). | 


Example 1.2. In this example we consider a random walk on the integers. Consider 
an experiment in which a particle is initially at the origin, 0. In the first trial the particle 
moves to +1 or to —1. In the second trial it moves either one integer to the right or one 
integer to the left. The experiment consists of 2 such trials (1 < n < oo). The sample 
space S is finite and there are Qn points in S,ie.,S = {(i,...,i)1ij =+1j = 


2n. E; is the 


2n 
1,08, 2}. Let Ey =4 Gis iwin) >) tes fF pf Oras 
k=1 


event that, at the end of the experiment, the particle is at the integer 7. Obviously, 
—2n < j < 2n. It is simple to show that j must be an even integer j = +2k, k = 
0,1,...,n. Thus, D = {Ex,,k =0,+1,..., En} is a partition of S. The event Ez 
consists of all elementary events in which there are (n + k) +1s and (n —k) —Is. 


— 


ereey ZEN: 


, . 2n : 
Thus, £>; is the union of (, 4 ‘) points of S,k = 0, 


The algebra generated by D, A(D), consists of @ and 27”+! — 1 unions of the 
elements of D. a 


Example 1.3. Let S be the real line, 1.e., S = {x : —oo < x < oo}. We construct an 
algebra A generated by half-closed intervals: E, = (—0o, x], —oo < x < oo. Notice 
that, for x < y, E, U Ey = (—o, y]. The complement of EF, is E, = (x, 00). We 
will adopt the convention that (x, 00) = (x, oo]. 


1 
Consider the sequence of intervals E,, = (~co. 1- “| > 1. AllE, € A.How- 
n 


[oe 
ever, UE = (—oo, 1). Thus lim E, does not belong to A. A is not a o-field. In 
n—-> oo 
n=1 
order to make A into a o-field we have to add to it all limit sets of sequences of 
events in A. = 


Example 1.4. We illustrate here three events that are only pairwise independent. 


1 
Let S = {1, 2, 3, 4}, with P(w) = 2 for all w € S. Define the three events 


A, ={1,2}, Ag={1,3}, A3 = {1,4}. 
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1 
P{A;} = =,i=1,2,3 
2 
A, MN A2 = {I}. 
A, N A3 = {I}. 
A2M A3 = {I}. 
Thus 
1 
P{A,M Ao} = 7 P{A\}P{A3}. 
1 
P{A, A3} = = P{A}P{A3}. 
1 
P{A2M A3} = Fi P{A2}P{A3}. 


Thus, A;, A>, A3 are pairwise independent. On the other hand, 
A, AN A3 = {1} 


and 


1 1 
P{A,NA2N A3} = A # P{Ai}P{A2}P{A3} = 3° 
Thus, the triplet (A;, Az, A3) is not independent. | 


Example 1.5. An infinite sequence of trials, in which each trial results in either 
“success” S or “failure” F is called Bernoulli trials if all trials are independent and 
the probability of success in each trial is the same. More specifically, consider the 
sample space of countable sequences of S's and F's, i.e., 


S={Gi,b,..):i=8, F, j=1,2,...}. 
Let 
Ep S{Gy io: pS Sh FS ys 


We assume that {E,, E2,..., £,} are mutually independent for all n > 2 and 
P{E;} = p forall j =1,2,...,0<p<l. 
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The points of S represent an infinite sequence of Bernoulli trials. Consider the 
events 


Aj ={Gi, bo, ...) 1 ij = S,ij41 = Fy ijan = S} 
= E; 0 Ej41 Ej42 
j =1,2,.... {A;} are not independent. 
Let B; = {A3j+1}, j = 0. The sequence {B;, j > 1} consists of mutually indepen- 
dent events. Moreover, P(B;) = pl — p) forall 7 = 1,2,.... Thus, S PCB) — 
oo and the Borel—Cantelli Lemma implies that P{B,, i.0.} = 1. That is the pat- 


tern SF'S will occur infinitely many times in a sequence of Bernoulli trials, with 
probability one. a 


Example 1.6. Let S be the sample space of N = 2” binary sequences of size n, 
n<o,ie., 


S={(Gi,...,in):ij) =0,1, j=l,...,n}. 
We assign the points w = (i), ..., in) of S, equal probabilities, i.e., P{(i,,...,i,)} = 


2~". Consider the partition D = {Bo, Bi,..., B,} tok =n + 1 disjoint events, such 
that 


n 
By Gipaiesh) a pa Osea: 
lad 


B; is the set of all points having exactly j ones and (n — j) zeros. We define the 
discrete random variable corresponding to D as 


X(w) = Yo Ip, (w). 
j=0 
The jump points of X(w) are {0, 1,...,}. The probability distribution function of 
X(w) is 
f(x) = Do y@)P{B}}. 
j=0 


It is easy to verify that 


P{Bj} = es PSO 
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where 


(Jeti J 
. LT FETT Te) =U,1,...,N. 
i) jia@=pr ? 


Thus, 


fao=> 1yy@o( 2. 
j=0 


The distribution function (c.d.f.) is given by 


where [x] is the maximal integer value smaller or equal to x. The distribution function 
illustrated here is called a binomial distribution (see Section 2.2.1). a 


Example 1.7. Consider the random variable of Example 1.6. In that example X(w) € 


{0,1,...,n} and fxG) = Cle j =0,...,n. Accordingly, 


Example 1.8. Let (S, 7, P) be a probability space where S = {0, 1, 2,...}. F is 
the o-field of all subsets of S. Consider X(w) = w, with probability function 


pj = Pw: Xw)= J} 


ri 
=e", j=0,1,2,... 
J: 


[o.e) i [oe 
ai 
for some A, 0 < 4 < 00.0 < p; < © forall j, and since y — =e, y py =i. 
j=0 j=0 
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Consider the partition D = {A,;, Az, A3} where Ay = {w:0<w < 10}, A. = 
{w : 10 <w < 20} and A3 = {w : w > 21}. The probabilities of these sets are 


10 


rs 
gn =P{Aj=e*) —, 
j=0 7" 


20 ni 
ust _— ok 
qo = P{Az} =e 3 1 and 

j=ll 

ae 

AZ 

B= P{As}=e* >> mh 
j=2l 


The conditional distributions of X given A; i = 1, 2,3 are 


x 


Xx 
— Fai) 


fxja,(%) = a Te i=1,2,3 


where bo = 0, by = 11, bp = 21, b3 = 00. 
The conditional expectations are 


bn? 4 
. i! 
=(b,_;—1)+ 
x hh Se. PETS 
b;-1 j 
7) 
jab 


where a+ = max(a, 0). E{X | D} is a random variable, which obtains the values 
E{X | Ai} with probability g;, E{X | Az} with probability g2, and E{X | A3} with 
probability q3. a 


Example 1.9. Consider two discrete random variables X, Y on (S, F, P) such that 
the jump points of X and Y are the nonnegative integers {0,1,2,...}. The joint 
probability function of (X, Y) is 


ay 

e*——__,  x¥=0,1,...,y;y=0,1,2,... 
fer, y) = G+)! a 

0, otherwise, 


where A, 0 < 1 < o, is a specified parameter. 
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First, we have to check that 


CO CO 
Yo fer@, y) = 1. 
x=0 y=0 
Indeed, 
y 
fry) = DO fave, y) 
x=0 
~ 
= Bie y=0,1, 
and 
CO , 
nw» 
iets =e =] 
y=0 ms 


The conditional p.d.f. of X given {Y = y}, y=0, 1, ... is 


—, x=0,]l.,..., 
fxyaly=ylty 2 
0, otherwise. 


Hence, 


a 
E{X|Y¥=y}=——)>x 


a anor 
y 
=A = 0, 1, 
5 y 
and, as a random variable, 
E(X |Y}= > 
= 
Finally, 
oe) , 
y 4” x 
E{E{X | Yy}}= = —=-. 
{E{X | Yi}=)5e o> 


y=0 
| 


Example 1.10. In this example we show an absolutely continuous distribution for 
which E{X} does not exist. 
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1 1 
Let F(x) = 5 + —tan7!(x). This is called the Cauchy distribution. The density 
function (p.d_-f.) is 


1 1 
f@m=- -—0O <x <O. 


x 14+x2’ 


It is a symmetric density around x = 0, in the sense that f(x) = f(—x) for all x. The 
expected value of X having this distribution does not exist. Indeed, 


ba DE [> -% 
d = ——d 
in ees =| 14 x? : 


1 
= — lim log(1 + T”) = oo. 
qT To 


Example 1.11. We show here a mixture of discrete and absolutely continuous 
distributions. 
Let 


; ifx <0 
Fac(x) = {; —exp{—-Ax}, ifx>0 


0, ifx <0 


Fax) = efye ieee 
j=o 7 


where [x] designates the maximal integer not exceeding x; 4 and w are real positive 
numbers. The mixed distribution is, forO < a < 1, 


0, i1fE20 
= lj 
oe ae" 4 afl — exp(—ax)], fx > 0. 
war k 
j=0 


This distribution function can be applied with appropriate values of a, A, and wu 
for modeling the length of telephone conversations. It has discontinuities at the 
nonnegative integers and is continuous elsewhere. a 


Example 1.12. Densities derived after transformations. 
Let X be a random variable having an absolutely continuous distribution with 


p.df. fx. 
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A. If Y = X2, the number of roots are 


0, ify <0 
miy)= 41, ify=0 
2, ify>0. 
Thus, the density of Y is 
0, y <0 
fry) = 1 Lf 
a aly) Fixe a yl ie 0: 
B. If Y = cos X 
_ JO,  if|y| > 1 
moy={O if |y| <1. 


For every y, such that |y| < 1, let &(y) be the value of cos~!(y) in the interval (0, 7). 
Then, if f(x) is the p.d.f. of X, the p.d.f. of Y = cos X is, for |y| < 1, 


fry) = 7 Dino + 2nj)+ 


+ fx(E(y) — 20 j) + fx(-&(y) + 20j) + fx(-&(y) — 277}. 
The density does not exist for |y| > 1. a 


Example 1.13. Three cases of joint p.d_-f. 


A. Both X,, X are discrete, with jump points on {0, 1, 2,...}. Their joint p.d.f. for 
0<iX < oils, 


~ 
xix. (41, X2) — (Boner 
x) x2! 
for xj = 0,...,%2,X2 = 0, 1,.... The marginal p.d_-f. are 
/2yY) 
fx, (41) = enh f2y” x; =0,1,...and 


x1! 


x2 


_,X 
Sx. (x2) = e ae x, =0,1,.... 
2! 


B. Both X, and X are absolutely continuous, with joint p.d_f. 


fx, x.(%, y) = 2lo, non). 
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The marginal distributions of X, and X> are 


fx, (¢) = 2x 0,1) and 
fxoQ) = 20. — y)lo,n(y). 


C. X, is discrete with jump points {0, 1, 2,...} and X, absolutely continuous. The 
joint p.d.f., with respect to the o-finite measure d N(x, )dy is, forO < 1 < 00, 


x 


Xr 1 
fxixi(X, y) = ea Tye = 91 Moen). 


The marginal p.d.f. of Xj, is 


x 


Xr 
fx(x) =e", KS 2. ot 
XxX: 


The marginal p.d.f. of X> is 


1 [o,0) os n aj 
fxuy) = ie > l—-e ye 7 Tan+y)- 
n=0 j= . 


Example 1.14. Suppose that X, Y are positive random variables, having a joint 
p.d.f. 


1 
fuy(x, y) = —e7~”” Teo, y(x), O<y<o, 0<x<y, 0O<A <a. 
y 
The marginal p.d.f. of X is 


cae | 

fx(x) =A / —e"dy 
x 
= AE\(Ax), 


[o.e) 
1 
where F\(€) = / —e “du is called the exponential integral, which is finite for all 
Uu 


g 
& > 0. Thus, according to (1.6.62), for x > 0, 


1 —Ay 
—e * Texg,c0)(Y) 


Frix(y | xo) = Oey 
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Finally, for xo > 0, 


[oe 
| e dy 
Xo 


E, (Axo) 


eo 


~ RE (AX) | 


E{Y |X =x}= 


Example 1.15. In this example we show a distribution function whose m.g.f., M, 
exists only on an interval (—on, fo). Let 


ifx <0 


0, 
ey | ee ea 


where 0 < A < ow. The m.g-f. is 


CO 
m=. f ge dy 
0 


d <7 
= — =[(1-- , —CO<t<d. 
Xr 


The integral in M(t) is oo if t > 4. Thus, the domain of convergence of M is 


Example 1.16. Let 


x= 1, with probability p 
‘~~ 10, with probability (1 — p) 


i=1,...,n. We assume also that X;,..., X, are independent. We wish to derive 


the p.d.f. of S, = ee The p.g.f. of S, is, due to independence, when g = 1 — p, 


i=1 
YX; 
ews) = 6)" 


=T ee") 
i=1 


=(pt+q)", -c <f <oo. 
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Since all X; have the same distribution. Binomial expansion yields 
n 
Ee} = 2 (5 )p%a aw a 


Since two polynomials of degree n are equal for all ¢ only if their coefficients are 
equal, we obtain 


P{S, = j}= @l — py, f=O...0. 


The distribution of S$, is called the binomial distribution. | 


Example 1.17. In Example 1.13 Part C, the conditional p.d.f. of X2 given {X; = x} 
is 


1 
fxixiQy |x) = es (0,14x)(Y)- 


This is called the uniform distribution on (0, 1 + x). It is easy to find that 


1 
Avix=xie 
and 
d+xy 
V{Y |X =xs= ‘ 
{Y | x} D 
Since the p.d.f. of X is 
_, Ax 
P{xX =x}s=e 2 x = 0, 1,2, 
x! 


the law of iterated expectation yields 
E{Y} = E{EtY | X}} 


E 4 aX 
2 2, 


since E{X} =i. 
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The law of total variance yields 


V{Y} = V{ELY | X}} + E{V{y | X}} 


cea x,t (+x 
av{5+ox{+ej "| 


1 1 2 
sty 4 0 SRA) 
4° 12 
Beery ee 
12 3 


To verify these results, prove that E{X} = a, V{X} = A and E{X7} = 4(1+A). We 
also used the result that V{a + bX} = b?V{X}. | 


Example 1.18. Let X;, X2, X3 be uncorrelated random variables, having the same 
variance 0%, i.e., 


T= ork: 
Consider the linear transformations 
Y,=X,+ X2, 
Y2= X,+ X3, 
and 
Y3 = X.+ X3. 
In matrix notation 
Y= AX, 


where 
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The variance—covariance matrix of Y, according to (1.8.30) is 


V[Y] = AYA’ 
=0°AA' 
21 1 
Sed Be 
1 1 2 
: ; as 1 
From this we obtain that correlations of Y;, Y; fori # j and pj; = =. r 


2 


Example 1.19. We illustrate here convergence in distribution. 


A. Let X1, X2,... be random variables with distribution functions 


0, ifx <0 
14 1 
Fal) +(1 Jo e), ifx>0. 
n n 


Xn mE X, where the distribution of X is 


0, x <0 
On| ee x > 0. 
B. X,, are random variables with 
0, x <0 


F,(x) = 


l-e™, x>0 


and F(x) = I{x > 0}. X, iy X. Notice that F(x) is discontinuous at x = 0. But, 
for allx £0 lim F,(x) = F(x). 
noo 


C. X,, are random vectors, 1.e., 
x, = (Xin; Xan); n= 1. 


The function /,(a, b), for0 < a,b < 00,0 < x < 1, is called the incomplete beta 
function ratio and is given by 


/ Adaya 


I,(a,b) = 2 FED 
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1 
where B(a, b) = i, utd _ uy! du. In terms of these functions, the marginal 


0 
distribution of X1,, and X>, are 


0, x <0 
1 1 
Fi,(x) = ~+(1-2) I,(a,b), O<x<1 
n n 
1, 1l<x 
and 
0, y <0 
1 
Fy,(y) = (: _ *) I(a,b), O<y<1 
n 
1, l<y 


where 0 < a,b < ow. The joint distribution of (X1n, X2n) iS Fr(x, y) = Fin(x)Fan(y), 


n > 1. The random vectors X,, aes X, where F(x) is 


0, x <Oory <0 
I,(a, b)Iy(a, b), O<x,y < 1 

F(x, y) = 4 Ix(a, b), O<x<l,y>l 
I, (a, b), 1l<x,0<y<l 
1, l<x,l<y. 


Example 1.20. Convergence in probability. 


Let X,, = (Xn, Xn), where X;,, (i = 1, 2) are independent and have a distribution 


0, x <0 
F(x) = ynx, Oe 5 ss 
1, ley, 


ne 


2 
Fix ane > O and let N(e) = Bi then for every n > N(e), 
€ 


PICG Aaa) Sel 4, 


Thus, X, —> 0. 
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Example 1.21. Convergence in mean square. 
Let {X,,} be a sequence of random variables such that 


Hee ae ee 0 <a < o and 
n 
b 
V{X,} = -, 0<b<o. 
n 


2 
b 

Then, X, as 1, as n > oo. Indeed, E{(X,, — 1)?} = 7 +-—> 0,asn— oo.g 

n n 


Example 1.22. Central Limit Theorem. 
A. Let {X,}, n = 1 be a sequence of i.i.d. random variables, P{X, = 1} = P{X, = 


1 2 Lie 
-l= 5° Thus, E{X,} =Oand V{X,} =1,n => 1. Thus /n X, = ae = 


n 
N(O, 1). It is interesting to note that for these random variables, when S, = yx ie 


i=l 
1 d s Sn a.s. 

—S, — N(, 1), while — —> 0. 

Jn n 


B. Let {X,,} be i.i.d, having a rectangular p.d_-f. 
f(*) = lon@). 


1 1 
In this case, E{X,} = 5 and V{X,} = DD’ Thus, 


12 

Notice that if n = 12, then if Sj = )°X;, then Si) — 6 might have a distribution 
i=l 

close to that of N(0, 1). Early simulation programs were based on this. a 


Example 1.23. Application of Lyapunov’s Theorem. 


Let {X,,} be a sequence of independent random variables, with distribution func- 
tions 


0, x <0 
Fylx)= {; —exp{—x/n}, x>0 
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n> 1. Thus, E{X,} =n, V{X,} =n’, and E{X23} = 6n?. Thus, B? = Las = 
k=1 


n(n + 1)Qn + 1) 


7 n > 1. In addition, 


> E{X}} =6) k= O(n’). 
k=1 k=1 


Thus, 


It follows from Lyapunov’s Theorem that 


Y(X% — &) 


k=1 


6 4. N(0, 1). 
Jn(n + 1)(2n + 1) 


Example 1.24. Variance stabilizing transformation. 
Let {X,,} bei.i.d. binary random variables, such that P{X, = 1} = p,and P{X, = 
O} = 1 — p. It is easy to verify that p = E{X\} = p and V{X\} = p(1 — p). Hence, 
Xn- 
by the CLT, /n ea hy N(O, 1), as n — oo. Consider the transformation 
pPUu— Pp 


g(X,) = 2sin"! /X). 
The derivative of g(x) is 


a 1 
Gisx. 29e° ~ goeay. 


eG) = 


Hence V{X}(g)(p))?? = 1. 
It follows that 


Va(2sin7!(/X,) —2sin“(\/p)) > NO, D. 
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1 1—2x 


(2) a Ne Se 
gO) ==) Ga pee’ 


Hence, by the delta method, 
1—2p 


E{g(Xp)} = 2sin'/p) — (pd = py?” 


This approximation is very ineffective if p is close to zero or close to 1. If p is close 


to 5 the second term on the right-hand side is close to zero. | 


Example 1.25. A. Let X,, X2,... be i.i.d. random variables having a finite variance 


- 2 1 
0 < 07 < &. Since /n(X, — ) ay N(0, 7), we say that X, — 4 = O, (=) 
n 


as n — oo. Thus, if c, 7 00 but c, = o(./n), then Cn(Xn — LL) —’, 0. Hence 


Xn — L = 0,(Cy), ASN > OO. 


B. Let X), X2,..., X, be i.i.d. having a common exponential distribution with 
p.d.f. 
0, ifx <0 
FO = [ews ifx >0 
0 < pu < ow. Let Y, = min[X;,i = 1,..., n] be the first order statistic in a random 


sample of size n (see Section 2.10). The p.d-f. of Y;, is 


0, ify <0 
nue", ify>0. 


fas h) = 


1 
Thus nY, ~ X, for all n. Accordingly, Y, = Op (<) as n — oo. It is easy to see 
n 


that ./n Y;, >) G. Indeed, for any given € > 0, 


P{nY, ><s=eow"~ 30 ano. 


1 
Thus, Y, = 0p (= asn > O. 
7) 
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Section 1.1 
1.1.1 Show that AU B= BUA and AB = BA. 


1.1.2 Prove that AU B = AU BA, (AU B)— AB = AB UAB. 


1.1.3. Show thatifA C BthenAUB=BandANB=A. 
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1.1.10 


1.1.11 


1.1.12 


1.1.13 


1.1.14 


1.1.15 


1.1.16 


BASIC PROBABILITY THEORY 


Prove DeMorgan’s laws, i.e., AU B = ANBorANB=AUB. 
Show that for every n > 2, (Us) — (Ai. 
i=l i=l 


Show that if A; C--- C Ay then sup A, = Ay and ; inf An = Aj. 


1<n<N 
: . 1 
Find lim | 0, 1— -— }. 
n—>oo n 


1 
Find lim (0. ~) 
n>oo n 
Show that if D = {A;,..., Ax} is a partition of S then, for every B, B = 


ie 
i=1 


Prove that lim A, C lim A,. 
n—>0o ATO 


Prove that Jan = tim JA, and ‘ae = Jim (Ay. 
n=l j=l n=l j=l 


(oe) 


Show that if {A,} is a sequence of pairwise disjoint sets, then lim Ja c= 
noo 


j=n 


Q. 
Prove that lim (A, U B,) = lim A, U lim B,. 
n—>0o n>oo noo 
Show that if {a,} is a sequence of nonnegative real numbers, then 


sup[0, a,) = [0, sup a,). 
n>1 n>1 


Let AAB = AB U BA (symmetric difference). Let {A,,} be a sequence of 
disjoint events; define Bj = Aj, Bn4; = B,AAn+1, n => 1. Prove that lim 


CO 
By = (JAn- 
n=1 


Verify 
(i) AAB = AAB. 
(ii) C = AAB if and only if A= BAC. 
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a) (Us.) (Um) Ursa, 
n=1 n=1 


n=1 


1.1.17 Prove that lim,o0 An = lim A 


N—>Oo “Nn 


Section 1.2 
1.2.1 Let A be an algebra over S. Show that if Aj, Az € A then A, A2 € A. 
1.2.2 LetS = {-,...,—2,—1,0,1,2,...} be the set of all integers. Aset A CS 


is called symmetric if A = —A. Prove that the collection A of all symmetric 
subsets of S is an algebra. 


1.2.3 Let S = {-,...,—2,—1,0,1,2...}. Let A; be the algebra of symmetric 
subsets of S, and let A, be the algebra generated by sets A, = {—2, —1, 
ij,...,in},n = 1, wherei; > 0, j =1,...,n. 


(i) Show that A; = A; M A) is an algebra. 
(ii) Show that A, = A, U A; is not an algebra. 


1.2.4 Show that if A is ao-field, A, C Ay41, for all n > 1, then lim A, € A. 
noo 


Section 1.3 


1.3.1 Let F(x) = P{(—ow, x]}. Verify 
(a) P{(a, b]} = F(b) — F(a). 
(b) P{(a, b)} = F(b—) — F(a). 
(c) P{[a, b)} = F(b—) — F(a—). 


1.3.2 Prove that P{A U B} = P{A}+ P{BA}. 


1.3.3. A point (X, Y) is chosen in the unit square. Thus, S = {(x, y):0< x,y < 
1}. Let B be the Borel o-field on S. For a Borel set B, we define 


p(s) = | | dxay. 
B 


Compute the probabilities of 


B={(,y):x> 5] 
C={@,y)ix7?+y' <1] 
D={(x, y):ixty<s I} 


P{DN B}, P{DNC}, P{CN B}. 
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1.3.4 


1.3.5 


1.3.6 


BASIC PROBABILITY THEORY 


Let S = {x :0 < x < oo} and B the Borel o-field on S, generated by the sets 
[0, x), 0 < x < oo. The probability function on 6 is P{B} = af ade, 


B 
for some 0 < A < oo. Compute the probabilities 


(i) P{X < 1/A}. 
= 1 2 
di) P{}-<X<-}. 
Xr Xr 
1 
(iii) Let B, = o, (1 + ~) /.). Compute lim P{B,,} and show that it is 
n n—>oo 
equal to P | lim Ba. 
noo 


Consider an experiment in which independent trials are conducted sequen- 
tially. Let R; be the result of the ith trial. P{R; = 1} = p, P{R; = 0} = 1—- 


p. The trials stop when (R,, Ro, ..., Rx) contains exactly two 1s. Notice that 
in this case, the number of trials NV is random. Describe the sample space. Let 
w,, be a point of S, which contains exactly n trials. w, = {(ij,.--,%n—1, D}, 
nl n—1 
na 2 where \ tier ka Gijsces th Vey 1 Sal). 
j=l j=l 


(i) Show that D = {E>, E3, ...} is a countable partition of S. 
(ii) Show that P{E,} = (n — 1)p*q"~*, where 0 < p < 1,q =1—p,and 
CO 


prove that SOP ial = 1? 
n=2 


(iii) What is the probability that the experiment will require at least 5 trials? 


In a parking lot there are 12 parking spaces. What is the probability that 
when you arrive, assuming cars fill the spaces at random, there will be four 
adjacent spaces vacant, while all other spaces filled? 


Section 1.4 


1.4.1 


1.4.2 


1.4.3 


Show that if A and B are independent, then A and B, A and B, A and B are 
independent. 


Show that if three events are mutually independent, then if we replace 
any event with its complement, the new collection is still mutually 
independent. 


Two digits are chosen from the set P = {0, 1, ..., 9}, without replacement. 
The order of choice is immaterial. The probability function assigns every 
possible set of two the same probability. Let Aj(i = 0, ..., 9) be the event 
that the chosen set contains the digit i. Show that for any i 4 j, A; and A; 
are not independent. 
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1.4.4 Let A;,..., A, be mutually independent events. Show that 


P (U- ani —| | PtAi. 
i=l i=l 
1.4.5 If an event A is independent of itself, then P{A} = 0 or P(A) = 1. 


1.4.6 Consider the random walk model of Example 1.2. 
(i) What is the probability that after n steps the particle will be on a positive 
integer? 
(ii) Compute the probability that after n = 7 steps the particle will be at 
x= 1. 
(iii) Let p be the probability that in each trial the particle goes one step 
to the right. Let A, be the event that the particle returns to the origin 
after n steps. Compute P{A,} and show, by using the Borel—Cantelli 


Lemma, that if p aa then P{A,, i.o.} = 0. 


1.4.7 Prove that 


oL()=* 
2) Cm#)=() 
0 SH(T)CAE) ee) MrT) 


1.4.8 What is the probability that the birthdays of n = 12 randomly chosen people 
will fall in 12 different calendar months? 


1.4.9 A stick is broken at random into three pieces. What is the probability that 
these pieces can form a triangle? 


1.4.10 There are n = 10 particles and m = 5 cells. Particles are assigned to the 
cells at random. 


(i) What is the probability that each cell contains at least one particle? 


(ii) What is the probability that all 10 particles are assigned to the first 3 
cells? 


Section 1.5 


1.5.1 Let F be a discrete distribution concentrated on the jump points —oo < 
& <& <--- < w. Let p; = dF(é), i = 1,2,.... Define the function 


Le ake SO 
uy ={o ifx <0. 
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1.5.2 


1.5.3 


1.5.4 


BASIC PROBABILITY THEORY 


(i) Show that, for all —co < x < co 


3 


F(x) = D> piU( - &) 


i=l 


ts 


pil (&i < x). 


i=l 


(ii) For h > 0, define 
D,jU(x) = ~[U@ +h) - Ua) = Te = —h)— I = 0)). 


Show that 


ia D npives —§&)dx =1 forallh > 0. 


(oe) 


(iii) Show that for any continuous function g(x), such that > Pilg(&)| < 
i=l 
OO, 


tin, [ 3 pig(x) Dp, U(x — &)dx = 3 Dig (i). 


i=1 


Let X be a random variable having a discrete distribution, with jump points 
i 


2 

€; =i, and p; = dF(é;) = Bi i=0,1,2,.... Let Y = X?. Determine 
i! 

the p.d.f. of Y. 


Let X be a discrete random variable assuming the values {1, 2, ..., 2} with 
probabilities 
2i po 
; = ——, i=-l.,...,n. 
Be Feel 


(i) Find E{X}. 
(ii) Let g(X) = X?; find the p.d.f. of g(X). 


Consider a discrete random variable X, with jump points on {1, 2, ...} and 
p.d.f. 


Cc 
Fa) = 5: [in be ee 
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1.5.5 


1.5.6 


1.5.7 


where c is a normalizing constant. 
(i) Does E{X} exist? 
(ii) Does E{X/ log X} exist? 


Let X be a discrete random variable whose distribution has jump points 
at {x1, X2,..., Xx}, | < k < oo. Assume also that E'{|X|} < co. Show that 
for any linear transformation Y = a + Bx, B #0,-c <a < oo, E{Y} = 
a+ BE{X}. (The result is trivially true for 8 = 0). 


Consider two discrete random variables (X, Y) having a joint p.d_f. 


Jj 
oC erremary (4) (Ad —p))", j=0,1,...,n, 


n=0,1,2,.... 


(i) Find the marginal p.d.f. of X. 

(ii) Find the marginal p.d-f. of Y. 
(iii) Find the conditional p.d.f. fyyy(j | n),n =0,1,.... 
(iv) Find the conditional p.d.f. fyjx(7 | j), 7 =0,1,.... 
(v) Find E{Y | X = j}, 7=0,1,.... 
(vi) Show that E{Y} = E{E{Y | X}}. 


Let X be a discrete random variable, X € {0, 1, 2,...} with p.d-f. 
fxr) =e"-e@™), n=0,1,.... 
Consider the partition D = {A1, Az, A3}, where 


A, = {w: X(w) <2}, 
Ao = {w:2 < X(w) < 4}, 
A; ={w:4< X(w)}. 


(i) Find the conditional p.d_f. 
fxip( | Ai), i=1, 2,3. 


(ii) Find the conditional expectations E{X | A;},i = 1, 2,3. 
(iii) Specify the random variable E{X | D}. 
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desi 
ayo 
7 
ar l! 
(i) Show that, for a fixed nonnegative integer j, F’j(x) is a distribution 


function, where 


1.5.8 Fora given a, 0 < A < o, define the function P(j;A) = e 


0, ifx <0 
ON fe P(j—1;x), ifx>0 


and where P(j;0) = I{j > O}. 
(ii) Show that Fj(x) is absolutely continuous and find its p.d_f. 
(iii) Find E{X} according to Fj(x). 


1.5.9 Let X have an absolutely continuous distribution function with p.d.f. 


3x2, if0<x<1 
[Os fs otherwise. 


Find E{e-*}, 


Section 1.6 


1.6.1 Consider the absolutely continuous distribution 


0, ifx <0 
F(x)= x, if0<x <1 
1, ifl<x 


of a random variable X. By considering the sequences of simple functions 


n ;—] ij—] . 
X,(w) = 0 |= =x <i], n>1 


i=1 


and 


n : 2 a < 
x10 = (=) 1} = xo <2), n>, 


i=l 


show that 


1 
1 
lim E{X,} = xdx = = 
noo 0 2 
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and 
! 1 
lim E{X?} -|/ ede a =; 
noo 0 3 


1.6.2 Let X be a random variable having an absolutely continuous distribution F’, 
such that F(0) = 0 and F(1) = 1. Let f be the corresponding p.d_f. 
(i) Show that the Lebesgue integral 


1 De 4s ; . 
i-1 i i-1 
Pia) Solin at ) a )I: 


(ii) If the p.d.f. f is continuous on (0, 1), then 


1 1 
/ xP{dx}= / xf (x)dx, 
0 0 


which is the Riemann integral. 


1.6.3 Let X, Y be independent identically distributed random variables and let 
E{X} exist. Show that 


xX+Y 


E(X |X +Y}= EY |X+¥}= 


a.s. 


1.6.4 Let X,,..., X, be iid. random variables and let E{X,} exist. Let S, = 
n s 
SOx; Then, E{X, | S,} = —,as. 
n 


j=l 
1.6.5 Let 
0, ifx <0 
ifx =0 
Fx(x) = ‘ 
reece if0<x <1 
1, if by, 


Find E{X} and E{X?}. 
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1.6.6 


1.6.7 


1.6.8 


1.6.9 


1.6.10 


1.6.11 


1.6.12 


BASIC PROBABILITY THEORY 


Let X,,..., X, be Bernoulli random variables with P{X; = 1} = p.Ifn = 


100, how large should p be so that P{S, < 100} < 0.1, when S, = y°Xi? 


i=l 
Prove that if E{|X|} < oo, then, for every A € F, 
EX|X|IA(X)} < E{|X]}. 
Prove that if E{|X|} < co and E{|Y|} < ow, then E{X+Y}= E{x}+ 
E{Y}. 
Let {X;,} be a sequence of i.i.d. random variables with common c.d.f. 


ifx <0 


0, 
BODE fi —e, ifx>0. 


Let Sn = Wx. 
i=], 


(i) Use the Borel—Cantelli Lemma to show that lim S, = 00 as. 
noo 


Sn 
(ii) What is lim E ? 
n>oo 14+ 5S, 


Consider the distribution function F of Example 1.11, witha = .9,A4 = .1, 
and uw = 1. 
(i) Determine the lower quartile, the median, and the upper quartile of 
Fac(x). 
(ii) Tabulate the values of Fy(x) for x = 0, 1, 2, ...and determine the lower 
quartile, median, and upper quartile of Fq(x). 


(iii) Determine the values of the median and the interquartile range IQR of 
F(x). 
(iv) Determine P{0 < X < 3}. 


Consider the Cauchy distribution with p.d_-f. 


1 
mo 1+ — py? /o?’ 


fA3 MW, 0) = o<x <O, 


with wu = 10 ando = 2. 
(i) Write the formula of the c.d.f. F(x). 
(ii) Determine the values of the median and the interquartile range of F(x). 


*,x > 0. Determine 


Let X be a random variable having the p.d.f. f(x) = e~ 
the p.d.f. and the median of 
(i) Y = log Xx, 


(ii) Y = exp{—X}. 
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1.6.13 Let X be arandom variable having a p.d.f. f(x) = 
mine the p.d.f. and the median of 
(i) Y = sinx, 
(ii) Y = cos X, 
(iii) Y = tan X. 


1 
ie <x< U peigic 
4 2, 2 


1.6.14 Prove that if E{|X|} < oo then 


0 lo) 
E{x}= -| F(x)dx +f (1 — F(x))dx. 
—oo 0 


1.6.15 Apply the result of the previous problem to derive the expected value of a 
random variable X having an exponential distribution, i.e., 


0, ifx <0 
Oe haere : 


ee”, ifx>0. 
1.6.16 Prove that if F(x) is symmetric around y, i.e., 
Fy, —x)=1-—F(n4+x-), forall 0 < x < ow, 


then E{X} = n, provided E {|X|} < oo. 


Section 1.7 


1.7.1 Let (X, Y) be random variables having a joint p.d.f. 


if-l<x<10<y<1-[x| 
otherwise. 


fxy@, y= ts 


(i) Find the marginal p.d-f. of Y. 
(ii) Find the conditional p.d.f. of X given {Y = y},O < y <1. 


1.7.2 Consider random variables {X, Y}. X is a discrete random variable with 
Pa 
jump points {0, 1, 2,...}. The marginal p.d.f. of X is fy(x) = ee x= 


0,1,..., 0<’A< ow. The conditional distribution of Y given {X= x}, 
x > l,is 


0, y <0 
Fy\x(y|x)=4y/x, O<y<x 
1, x<y. 
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When {X = O} 


0, <0 
Fx 0 =f) on 


(i) Find E{Y}. 
(ii) Show that the c.d.f. of Y has discontinuity at y = 0, and Fy(0) — 
Fy(0—) = e~*. 


(oe) 
(iii) For each 0 < y < &, Fi(y) = fry), where | fr(y)dy =1—e7”. 
0 
Show that, for y > 0, 


+ oa ie 
fr) =o i{n-1<y<nje* oe 
n=l x=n i 


[oe 
and prove that f fr(y)dy =1-—e™%. 
0 


(iv) Derive the conditional p.d.-f. of X given {Y = y},0 < y < ov, and find 
E{X | Y= y}. 


1.7.3 Show that if X, Y are independent random variables, E{|X|} < oo and 
E{|Y| < oo}, then E{XY} = E{X}E{Y}. More generally, if g, h are inte- 
grable, then if X, Y are independent, then 


E{g(X)h(Y)} = Efg(X)pEth)}. 


1.7.4 Show that if X, Y are independent, absolutely continuous, with p.d.f. fx 
and fy, respectively, then the p.d.f. of T = X + Y is 


fro= / Fx(x) fy(t — x)dx. 


[fr is the convolution of fy and fy.] 
Section 1.8 


1.8.1 Prove that if E{|X|"} exists, r > 1, then lim (a)’ P{|X| > a} =0. 
a>oo 


1.8.2 Let X, Xz bei.i.d. random variables with E{X 7} < oo. Find the correlation 
between X; and7T = X,4+ X,. 


1.8.3 Let X,,..., X, be i.i.d. random variables; find the correlation between X 


7 ie 
and the sample mean X,, = gee 
i= 
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1.8.4 Let X have an absolutely continuous distribution with p.d_f. 


1.8.5 


1.8.6 


1.8.7 


1.8.8 


0, ifx <0 
Ix) = a” xml e-hx 


( DI ifx > 0 
m — 1)! 


where 0 < 4 < 00 and m is an integer, m > 2. 


(i) Derive the m.g.f. of X. What is its domain of convergence? 
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(ii) Show, by differentiating the megf. M(t), that E{X’}= 


1)--: =] 


(iii) Obtain the first four central moments of X. 
(iv) Find the coefficients of skewness 6, and kurtosis f>. 


Let X have an absolutely continuous distribution with p.d_f. 


1 
PO Gee: ifa<x<b 
x(x) = —a 
0, otherwise. 


(i) What is the m.g.f. of X? 
(ii) Obtain E{X} and V{X} by differentiating the m.g_f. 


Random variables X,, X2, X3 have the covariance matrix 
3 0 0 
Y=]0 2 1 
0 1 2 
Find the variance of Y = 5x, — 2x2 + 3x3. 


Random variables X;,..., X, have the covariance matrix 


Y=l+J, 


n 


x 1 
where J is ann X n matrix of 1s. Find the variance of X, = -)°X;. 
n 


i=1 


Let X have a p.d.f. 


| ae 
fxQ@) = eo? -w<x< mw. 


Jn 


Find the characteristic function ¢ of X. 
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1.8.9 


1.8.10 


1.8.11 


1.8.12 


1.8.13 


1.8.14 


BASIC PROBABILITY THEORY 


Let X,,..., X, be iid., having a common characteristic function @. Find 


: i 
the characteristic function of X, = — ) Xj. 
ny 


If @ is a characteristic function of an absolutely continuous distribution, its 
p.d.f. is 


1 oe. ; 
f= 5 / og (tdi. 
Tt. ef 


CO 


Show that the p.d.f. corresponding to 


_jl-tel, ils 
w= fe “| >1 
iS 
FQ) 1 —cosx | ee 
x) = ————_,,_ |x — 
wx? mee?) 


Find the m.g.f. of a random variable whose p.d.f. is 


a— |x| 


fx(x) = a? 


0, if |x| >a, 


if |x| <a 


0<a<oam. 


Prove that if ¢ is a characteristic function, then |(t)|? is a characteristic 
function. 


Prove that if @ is a characteristic function, then 
(i) lim @(t) = Oif X has an absolutely continuous distribution. 
|t|>0o 


(ii) lim sup|@(t)| = 1 if X is discrete. 
|t| oo 


Let X be a discrete random variable with p.d_f. 


x 


= _ 
f) = e xT x=0,1,... 
0, otherwise. 


Find the p.g.f. of X. 
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Section 1.9 


1.9.1 


1.9.2 


1.9.3 


1.9.4 


1.9.5 


1.9.6 


Let F,, n> 1, be the c.d.f. of a discrete uniform distribution on 


1 2 
{recesses Show that F,,(x) ey F(x), as n — oo, where 
nn 
0, ifx <0 
F(x)= 4x, if0<x<1 
1, ifl <x. 


Let B(j;n, p) denote the c.d.f. of the binomial distribution with p.d.f. 
b(j3n, p) = (“pa = p= PHO cc Wy 
where 0 < p < 1. Consider the sequence of binomial distributions 
1 
F(x) = B | [x];n, n MO<x<njt+i{x>n}, n21. 
n 


What is the weak limit of F,,(x)? 


Let X,, Xo,..., Xn,... be iid. random variables such that V{X1} = 
o” <0, and 4p = E{X1}. Use Chebychev’s inequality to prove that X,, 


1 


n 


n 
P 
) XxX; — wasn ~. 


i=1 


Let X,, X2, ... be a sequence of binary random variables, such that P{X, = 


1 1 
1} == and PY, HOF SS = nS: 
n n 


(i) Show that X, —> 0asn— oo, for anyr > 1. 


(ii) Show from the definition that X,, + Oasn > oo. 
(iii) Show that if {X,,} are independent, then P{X, = 1, i.o.} = 1. Thus, 
X, & Oas. 


Let €,, €2,... be independent r.v., such that F{e,} = uw and V{e,}= o? 
for all n > 1. Let X; = €, and for n > 2, let X, = BXy,_) +€,, where 


= Te 
—1 < B <1. Show that X, = —)°X; zs £ ,asn > ©. 
n 


i=l 1-86 


Prove that convergence in the rth mean, for some r > 0 implies convergence 
in the sth mean, for allO < 5 <r. 


88 


1.9.7 


1.9.8 


1.9.9 


1.9.10 


1.9.11 


1.9.12 


BASIC PROBABILITY THEORY 


Let X;, X2,..., Xn,... be iid. random variables having a common rect- 
angular distribution R(0, 0),0 < @ < oo. Let X(n) = max{X1,..., X,}. Let 
CO 


€ > 0. Show that Y > Pa X (n) < 6 — €} < oo. Hence, by the Borel—Cantelli 
n=1 


Lemma, Xn) Bisa 6,asn — oo. The R(0, @) distribution is 


0, ifx <0 
F(x) = 7 if0<x <0 
1, ifO0<x 
where 0 <0 < @. 
P P 


Show that if X, —> X and X, —> Y,then P{w: X(w) 4 Y(w)} =0. 


Let X,, aE X, Yy, = Y, P{w : X(w) 4 Y(w)} = 0. Then, for every 


e>0, 


P{|X,-—Y,| =e} 7-0, asnow. 


: d . 
Show that if X,, —> C asn — ov, where C is aconstant, then X, Ea Oo 


[oe] 
Let {X,,} be such that, for any p > 0, )°E{|Xn|"} < 00. Show that X, 


n=1 
Oasn > oo. 


Let {X,} be a sequence of i.i.d. random variables. Show that E {|X|} < co 
CO 


if and only if SPX > €-n} < oo. Show that E|X {| < o6 if and only 
n=1 


if — —> 0. 
n 


Section 1.10 


1.10.1 


1.10.2 


Show that if X, has a p.d-f. f, and X has a p.d-f. g(x) and if | | f,(x) — 

g(x)|\dx > 0asn — ov, then sup|P,{B} — P{B}| ~ 0asn > om, for all 
B 

Borel sets B. (Ferguson, 1996, p. 12). 


Show that if a’X,, *, a'Xasn > oo, for all vectors a, then X,, “4x 
(Ferguson, 1996, p. 18). 
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1.10.3 Let {X,,} be a sequence of i.i.d. random variables. Let Z, = /n(X, — [4), 


a 1 
n > 1, where wp = E{X,} and X, = ~S°X;. Let V{X,} < oo. Show that 
n 
i=l 
{Z,} is tight. 


1.10.4 Let B(n, p) designate a discrete random variable, having a binomial distri- 


1 
bution with parameter (n, p). Show that {2 ( =) is tight. 
n 


1.10.5 Let P(A) designate a discrete random variable, which assumes on 


Xr 
{0,1,2,...} thep.d.f. f(x) = ex =0,1,...,0<A < ow. Using the 
x! 


continuity theorem prove that B(n, pn) P(A) if limnp, = 2. 
noo 


1 
1.10.6 Let X, ~ B (x. =) n> 1.Compute lim E{e~*"}. 
2n n—>oo 


Section 1.11 


1.11.1. (Khinchin WLLN). Use the continuity theorem to prove that if Xj, 


Xo,...,Xn,-.. are iid. random variables, then X,, ~hy [4, where 
b= E{Xj}. 


1.11.2) (Markov WLLN). Prove that if X;, X2,..., Xn, ... are independent random 
variables and if uw, = E{X;,} exists, for all k > 1, and E|X; — px tt? < 0O 


1 
for some 6 > 0, all k > 1, then ae DEX _ teehee > O0asn—> oo 


k=1 
n 


. 7 1 Pp 
implies that — ) (XX, — Uk) — Oasn> ow. 
n 
k=1 


1.11.3 Let {X,,} be a sequence of random vectors. Prove that if X, a pm then 


X, —> pw, where X 2% ani ee 
n > n n J 1s- 


j=l 


1.11.4 Let {X,,} be a sequence of i.i.d. random variables having a common p.d.f. 


0, ifx <0 
= A 
f@) xm lem ax if x > 0, 
(m — 1)! 
where 0 < 2 < c0,m = 1, 2,.... Use Cantelli’s Theorem (Theorem 1.11.1) 


= as. MM 
to prove that X, —> re asn > oO. 
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1.11.5 Let {X,,} be a sequence of independent random variables where 
Xn, ~ R(—n,n)/n 


and R(—n, n) is arandom variable having a uniform distribution on (—n, n), 
1.€., 


1 
n = lena : 
Fn) mn ( n(x) 
Show that X,, —> 0, as — oo. [Prove that condition (1.11.6) holds]. 


1.11.6 Let {X,,} be a sequence of i.i.d. random variables, such that |X,| < C a.s., 
for alln > 1. Show that X, = jh asn — oo, where w = E{X}}. 


1.11.7 Let {X,,} be a sequence of independent random variables, such that 


1 1 
Pik, = 1) =5(1- =) 


and 


P{X, = tn} = n> 1. 


1 
Qn’ 


Nile 


ee as. 
Prove that -S°X; —s. 0,asn > oo. 
n 


i=l 
Section 1.12 
1.12.1 Let X ~ P(A), ie., 


Xr 
fa)=e*—, x=0,1,.... 
x! 


Apply the continuity theorem to show that 


X—-kh 4a 
—— —> N(0,1), asrt->o~. 


Vd 


1.12.2 Let {X,,} be a sequence of 1.i.d. discrete random variables, and X; ~ P(A). 
Show that 


Sy —nrX a N(O, 1) 
—_— ,1l), aan ow. 
Vnid 


What is the relation between problems | and 2? 
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1.12.3 


1.12.4 


1.12.5 


1.12.6 


1 
Let {X,,} bei.id., binary random variables, P{X, = 1} = P{X, = 0} = > 
n > 1. Show that 


i=1 d 


> N(O, 1), as n > ov, 


n(n + 1)(2n + 1) " 


Sir 
24 = 


where Be = 


Consider a sequence {X,} of independent discrete random variables, 
P{X, =n} = P{X, =—n} = 
the CLT, in the sense that 


,n => 1. Show that this sequence satisfies 


Nile 


V6 Sp d 
Jn(n + 1)(2n + 1) 


> N(O, 1), as n > oo. 


Let {X,,} be a sequence of i.i.d. random variables, having a common abso- 
lutely continuous distribution with p.d.f. 


1 1 
at el 

f(x) = 4 2/+/ log™ |x| : 
0, if |x] > —. 

e 


Show that this sequence satisfies the CLT, i.e., 


Re 
ie 2 2 NO) Be Se 
oO 


where 0” = V{X}. 
(i) Show that 


(G(1,n) —n) 
Ja 


23 N(O, 1), as n > oo 


where G(1, 7) is an absolutely continuous random variable with a p.d_-f. 


0, ifx <0 


= 1 
8n() ( pe x>0. 
n— ! 
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(ii) Show that, for large n, 


1 
&n(Nn) = ieee anil e CS 


(n— 1)! Jon Jn 


1 
nl © /2n n"t2e" as n> oo. 
This is the famous Stirling approximation. 


Section 1.13 
1.13.1 Let X, ~ R(—n,n),n > 1. Is the sequence {X,,} uniformly integrable? 


Xp —_ . : ‘ 
1.13.2 Let Z, = 7a ~ N(O, 1),n = 1. Show that {Z,,} is uniformly integrable. 
n 


1.13.3 Let {X), Xo,...,Xn,...} and {Y1, Yo,...,Yn,...} be two independent 
sequences of i.i.d. random variables. Assume that 0 < V{X,}= o < 
oo, 0< V{Y)}= o; < oo. Let f(x, y) be a continuous function on R?, 
having continuous partial derivatives. Find the limiting distribution of 
Jn f (Xn, Yn) — F(,)), where € = E{X },n = E{Y;}. In particular, find 


the limiting distribution of R, = X,,/ Y,, when 7 > 0. 


1.13.4 We say that X ~ E(w), 0 < w < ©, if its p.d-f. is 


0, ifx <0 
A a ae ifx > 0. 
Let X1, Xo,..., Xn,... be a sequence of i.i.d. random variables, X; ~ 


- ti 
E(u),0 < uw < w. Let X, = -)°X;. 
ree 


(a) Compute V {e*"} exactly. 
(b) Approximate V {e*"} by the delta method. 


1.13.5 Let {X,,} bei.i.d. Bernoulli random variables, i.e., X; ~ BC, p),0 < p < 1. 


1 
Let p, = —) X; and 
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1.13.6 


1.13.7 


Use the delta method to find an approximation, for large values of n, of 
(i) E{W,,} 
(ii) V{W,,}. 


Find the asymptotic distribution of ./n (w. — log ()). 
—P 


Let X,, X2,..., X, bei.i.d. random variables having a common continuous 
distribution function F(x). Let F,,(x) be the empirical distribution function. 
Fix a value x9 such that 0 < F,,(xo0) < 1. 

(i) Show that n F,(xo) ~ B(n, F(x0)). 

(ii) What is the asymptotic distribution of F,,(xo) as n > 00? 


Let X,, X2,..., X, be ii.d. random variables having a standard Cauchy 
distribution. What is the asymptotic distribution of the sample median 


F-! 1), 
n 2) 
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1.1.5 


1.1.10 


k 
For n = 2, Aj U Ay = A, N A). By induction on n, assume that Jai = 


i=l 


k 
(Ai for all k = 2,...,n. Fork =n +1, 
i=l 


=1 i=l 
We have to prove that (tim An) Cc (fim An). For an elementary event 
noo 


w € 6S, let 
l, ifweA, 
1,0) = {9 ifw ¢ Ap. 


CO CO 
Thus, ifw € lim A, = U (An, there exists an integer K (w) such that 
BOD) n=1 k=n 
CO 


[] 44.0%) =1. 


n>K(w) 


CO CO CO 
Accordingly, for alln > 1,w € |_JAc. Here w € () Ac = lim A,. 
n—->oo 


k=n n=1k=n 
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1.1.15 


1.2.2 


1.2.3 


BASIC PROBABILITY THEORY 
Let {A,,} be a sequence of disjoint events. For all n > 1, we define 


By, = Bn-1AAy 
= Bn-1An U By-1An 


and 


B, = A, 
Bo= A\ A> U A\A> 


B3 = (Aj Az U Ay A2)A3 U (Aj An U Aj A2)A3 


= (A, Ao M A,A2)A3 U A\A2A3 U AAA; 
= (A, U A2)(A] U A2)A3 U A A2A3 U Al A2A3 
= A,A,A3 U A, AA; U A, AA; U A, A,A3. 


By induction on n we prove that, for all n > 2, 


B, = Qa, U Ua, (\ 4; nay 
j=l i=l j#i i=1 
CO 
Hence B, C B,+; for alln > 1 and lim B, = |_JAn. 
n->oo ga 


The sample space S = Z, the set of all integers. A is a symmetric set in S, if 
A = —A. Let A = {collection of all symmetric sets}. @ € A. If A € A then 
A «€ A. Indeed —A = —S — (—A) = S— A= A. Thus, A € A. Moreover, 
if A, Be Athen AU B € A. Thus, A is an algebra. 


S=Z. Let A; = {generated by symmetric sets}. A. = {generated by 
(-2,-1,i,...,in), n=1, i; €eN Vj =1,...,n}. Notice that if A= 
(—2, -1,i,,...,i,) then A = {(--- , -4, —-3, N—(i),..., i,))} © Ap, and 
S= AU A, € Ap. A is an algebra. A; = Ay NA. If B € A; it must 
be symmetric and also Be A>. Thus, B=(—2,-1,1,2) or B= 
(--- ,—-4, -3,3,4,...). Thus, B and B are in A3, so S=(BUB)€ A; 
and so is @. Thus, A is an algebra. 

Let Ay = A; U A>. Let A = {—2, —-1,3,7} and B = {—3, 3}. Then 
AUB = {-3, —2, —1,3;7}. But A U B does not belong to A; neither to 
A2. Thus AU B ¢ Ag. Ag is not an algebra. 
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1.3.5 The sample space is 


n—-1 
Satta) > aol n > 2}. 
j=l 


n—-1 
Gy) Leth 4 Geeta ey fal, Ca Sd Forge 
j=l 


E,; NE, = 9. Also U E, = S. Thus, D = {E>, E3,...} is a count- 
n=l 


able partition of S. 


(ii) Allelementary events w, = (i), ...,in—1, 1) € En are equally probable 
and P{w,} = p?q"~?. There are (a) =n — | such elementary events 
in E,,. Thus, P{E,} = (n — peg Moreover, 


Sear Sag 
n=2 n=2 


Indeed, 
= ed 
fgets Set 
d q 
aed 
ee Gee 
~ (=qP pp? 


(iii) The probability that the experiment requires at least 5 trials is the 
probability that in the first 4 trials there is at most 1 success, which is 
1 — p?(1 + 2q + 3q7). 


1.4.6 Let X,, denote the position of the particle after n steps. 


(i) Ifn = 2k, the particle after n steps could be, on the positive side only on 
even integers 2,4, 6,..., 2k. Ifn = 2k + 1, the particle could be after 
n steps on the positive side only on an odd integer 1,3,5,...,2k +1. 
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Let p be the probability of step to the right (0 < p < 1)andg=1-—p 
of step to the left. Ifm = 2k + 1, 


2j A 3 
Pix, =2j +1) = (7) patig! j=0,...,k. 


Thus, ifn = 2k + 1, 


k 


2] +2 
P{X, > 0} => (jena 
j=0 


In this solution, we assumed that all steps are independent (see Sec- 
tion 1.7). If n = 2k the formula can be obtained in a similar manner. 


F 1 ce) 
(i) P{X, =1) = @)p*@?. If p= 5 then P{X7 = 1} = > = 0.15625. 


(iii) The probability of returning to the origin after n ake is 


0, ifn =2k+1 
=O} = 2k 
Pa ( ) ota’ ifn = 2k. 
k 
CO 
1 
Let A, = {X, = 0}. Then, SY P{ Ari} = 0 and when p = > 
k=0 


2k\ 1 = ORE OF 
DPtaat= (7 \a-Doe go 
k=0 


1 
Thus, by the Borel—Cantelli Lemma, if p = > P{A,i.o.} = 1. On the 


1 
other hand, if p ~ rt 


(oe) 


yf (%) (pqy = aud < 00 
a Oh JT = 4pq (1+ /1 = 479) 


1 
Thus, if p # 5, P{Ani.o.} = 0. 


1.5.1 F(x) is a discrete distribution with jump points at —oo < | <& <---< 
oo. pi = d(F§),i = 1,2,....U(x) = I(x = 0). 
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@) U(x — &) = Ie > &) 
F(x) = 0 pi = >> iV - &). 
Ej <x i=l 


(ii) For h > 0, 
1 
D,U(x) = 7 [UG +h) — UD). 
U(x +h) = 1if x > —h. Thus, 


D,U(x) = a <x <0) 


= mn Ck) ae h) 
En lim “Ye igi) 


§ d 
Here, G(&)) = / e(xidv; 5 GE) = #6). 


1.5.6 The joint p.d.f. of two discrete random variables is 


97 


ew Pp j 
ixyvG.n) = - : ( ) oa p))", 7 =9,...,nn=0,1,.... 
PUES PEN ES 


(i) The marginal distribution of X is 


fx =o ferG.n) 


n=j 
— pia =p) 4, SAC = py 
= gl 2 Gag! 


n=j 
(Ap)! “(ACL = p)y" 
gr ge: 
]! om n! 


ApY 
ae j=0,1,2,.... 
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(ii) The marginal p.d.f. of Y is 


n 


f(a) = Yo pxy(G.n) 


j=0 
won . : 
=> ("\ pla py 
We Tay NI 
AP 
= ze n=0, 1, 
n!} 


i) pyy(i |) = 2 (")ora — py F, fF =0.....n. 


fr) 
. frlin) 
(iv) Pyx| j= 7a 
= p-Ml-p) (ACL — py" si 
(n—j)t - = 
(v) EY |X=j)=jt+A0 — p). 
(vi) E{Y} = E{E(Y | X}} = AC — p) + E{X} 
=1—p)+tAp=a. 
0, ifx <0 
138 OS eee ifx > 0, 
i Fag a 
where j > 1, and P(j — 1;x) =e™~* a 
x! 


(i) We have to show that, for each j > 1, Fj(x) isac.df. 
(i) O< Fy(x) < lforallO<x <—@. 
(ii) Fj(O) = 0 and lim Fj(x) = 1. 
X—>0O 


(iii) We show now that Fj(x) is strictly increasing in x. Indeed, for all 


x >0 
a A ee 
xi=-L a (°F) 
Jak xi xi-1 
xi! 
=e *———_ +0, foralO <x <o. 


G1) 
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(ii) The density of F'j(x) is 


xit! : 
fix) = Gane? J>1, x>0. 


F(x) is absolutely continuous. 


ii) Ej{X} = / MT gti 
0 M 


1.6.3 X,Y are independent and identically distributed, E|X| < oo. 


E{X|X+Y)+E(¥|X+Y}=X4+Y 


X+Y 
E{X|X+YJ=EF{Y|X+Y}= ae 
0, x <0 
1.6.9 F(x) = ees x>0. 


(i) Let A, = {X, > 1}. The events A,, n > 1, are independent. Also 
CO 


P{A,} = e7!. Hence, >> P{An) = oo. Thus, by the Borel—Cantelli 
n=1 


Lemma, P{A,ji.0.} = 1. That is, P ( lim S, = 00) =i 
n—- Oo 


Sn 
(ii) > 0. This random variable is bounded by 1. Thus, by the Dom- 


1+ S, 
. . Sn . Sn 
inated Convergence Theorem, lim FE = Ey} lim 
n—>00 1+5S, n>ool+ S$, 

=1. 

0, ifx <0 

= Ar” 
1.8.4 fx() mle dx | x>0;m>2. 
(m — 1)! 


(i) The m.g.f. of X is 


m 


r ie (A—t) 1 
M(t) = ———— TIA yd 
(t) ——a | e x x 


ym t are fi 8 
= ={1 , fort <d. 
(A — ty” Xr 
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The domain of convergence is (—o«, A). 


oe M'(t) — m (1 t ies 
oe ae ] 
- m(m + 1) t —(m+2) 
M"(t) = — (1 *) 


—(m+r) 
ming = mero -) 
Av Xr 
1)--- —1 
Thus, 4, = M()| 9 = m(m + 1) “tn +r-—-1) ee 
m 
(iii) Mi= ae 
= m(m + 1) 
Be aa 
7 m(m + 1)(m + 2) 
haa 
a m(m + 1)(m + 2)(m + 3) 
M4 = vr . 


The central moments are 
uy, =0 
* 2 m 
My = b2- hy = 2 
WS = 3 — 32 + 2u} 
1 
= sa mn + 1)(m + 2) — 3m?(m + 1) + 2m?) 


2m 
Be 
= 4 — 431 + 62M; — 314} 


Be 
| 


L 
= Sonn + 1m + 2)(m + 3) — 4m2(m + 1)(m + 2) 


+ 6m3(m + 1) — 3m*) 


3m(m + 2) 
aa oe 


: 2m 2 — 3m(m +2) 6 
(iv) See a po = = . 
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1.8.11 The m.g.f. is 


1 a 
Mx() = 5 i Bate 


a 


2(cosh(at) — 1) 
= at? 


1 
=1+ pa +o(t), ast—0. 


0, «<0 
: . 4] 
1.9.1 F(x) = oe: Denis Ue ,J=0,...,n-1 
non n 
1, Il<x. 
0, x <0 
F(x)= 4x, O<x<1 
1, l<x. 


All points —oo < x < oo are continuity points of F(x). lim F,(x) = F(x), 
n—->oo 


1 

for all x <0 or x > 1. |F,(x) — F(x)| < — for all O<x <1. Thus 
n 

F,(x) —> F(x),asn > co. 


1 
0, wp. (1 - *) 
1.9.4 a ae oo 
1, wp. - 
n 


1 r 
Gi) E{|X,|"} = —1 = — for allr > 0. Thus, X, —> 0, forall r > 0. 
n n noo 


(ii) P{|X,| > €} = — forall > 1, any > 0. Thus, X, —> 0. 
n (oe) 


- 1 = 1 
(iii) Let A, = {(w : X,(w) = 1}; P{A,} = —, n> 0. >= =o. Since 
n 17 
X1, Xo, 


are independent, by Borel—Cantelli’s Lemma, P{X, = 
1,i.0.} = 1. Thus X, 4 Oas. 


1.9.5 €), €),--- independent r.v.s, such that E(e,) = w, and V{e,} = o?.Wn > 1. 
X, =€1, 
Xn = BXp-1 + €n 


= B(BXpn_-2 + €n-1) + &n 


voce SB Ie, Yn>1, |p| <1. 
j=l 
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n—-1 
Thus, E{X,} =“) B! —> ar 
j=0 
Ral yy pte 
oer ares | 
gue 
j=l i=j 


j=l 
Since {€,} are independent, 
7 o* : j+1)2 
V{X,} = 1 ome 
{Xn} Paap a! pe) 
2 i n+1 2 = 2n+1 
= a (1 9 Fi p yy BA = ) >O asn> ow. 
n(1 — B) n(1 — B) n(1 — B*) 
Furthermore, 
es ie \" 
E {| Xx, - —— = V{X, E{X,} — —— 
{( = | 1+ ( ee ay 
pe Ne we 
VY em n+1y2 
(e150 5) = a — pp By >O0 a now. 
Hence, X,, Ae alan 
hp 
1.9.7 X,, Xo,...i.i.d. distributed like R(0, 0). X, = max (X;). Due to indepen- 
dence, — 
0, x <0 


F(x) = P{Xq <x} = (=). osx<e 


1, O<x. 
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Accordingly, P{Xq) <9-—eE} = (1 _ =) , 0O<e <9. Thus, 
CO 

SOP {Xia <@0-—e}<oo, and P{X~m) <0-—€,i.0.} =0. Hence, 
n=1 

Xin) > O as. 


1.10.2 We are given that a’X,, —‘, a'X for all a. Consider the m.g.f.s, by conti- 
nuity theorem Myx, (t) = E{e%"} > E{e'@=%}, for all t in the domain of 


convergence. Thus E{e®)*"} + E{e“)*} for all B = ta. Thus, X, se 4 


1111 @ My,(t) = (ms (<)) 


e'# is the m.g.f. of the distribution 


x<p 
x=. 


F(x) = 


ae - d 2 
Thus, by the continuity theorem, X,, —> w and, therefore, X,, a LL, 
asn —> Oo. 


1.11.5 {X,,} are independent. For 6 > 0, 


Xn, ~ R(—n,n)/n. 
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The expected values are E{X,} =OVn > 1. 
2 4n? _ 1 
" 12n2 ~ 3 
CO o2 
ee We < © 
n=1 
Hence, by (1.11.6), X, —> 0. 
1.12.1 ie =P fem} = eV), 
Vi 
1 {t/Va} = 1 ce 
— ex = Bese 
: Ya OR 
ee: a 
) oe. = Bh 
Thus, 
ipsa) + +0/( : ) 
_ —e =s —]. 
2 Jd 
Hence, 
Mx-x(t) xp{5+0(-)| P12 agi > 00 
X-2 — > e : 
ss 2 Vi 
Mz(t) = e’/? is the m.g.f. of N(O, 1). 
1.12.3 P(X, =)= - 
1 
P(X, =0)=-, 
( ) 5 
E{Xi}= : 
nj — 2 
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n n z 
Let Y, =nXn3E{Y,} = 5 ElYal? = z Notice that ) 7X; = 


n 


xe? — pi), where 1; 


i=1 


aes Pps] 1 
Elly wil} 1 n2(n-+ 17 
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n(n+ 1) _ 
=—4 = 


i=1 


. 3 
= 5 = E{Y¥)}. ElY; — wi)? = “Accordingly, 


8 


> 0 


asn > ©. 


n 1 3/2 
DE — wi? (spr + 1)Qn + ») 
i=l 


Thus, by Lyapunov’s Theorem, 


Six 


i=1 


1 1/2 
(syn +1)Qn+ ») 


n(n + 1) 
tae 


1.13.5 {X,} iid. BC, p),0 <p <1. 


(i) 


me N(O,1) asn>o. 


is P 
E{W,} =1 1 
Ne i 5, Ph 


(— p)d—-p+t+p) 


p)W'(p) 


= 1 


mee Be 


(1 — 2p) 


w” Eo 
(P) =~ py 


Thus, 


P 


~ p(l— p) 


1—2p 


E{W,} = log ( 


a 


~ PU— p) 


2np(1 — p) 


1 


(ii) V{W,} & 
i. St 
~ np(1 = p)’ 


n (p(l—p)y 


CHAPTER 2 


Statistical Distributions 


PART I: THEORY 
2.1 INTRODUCTORY REMARKS 


This chapter presents a systematic discussion of families of distribution functions, 
which are widely used in statistical modeling. We discuss univariate and multivariate 
distributions. A good part of the chapter is devoted to the distributions of sample 
statistics. 


2.2 FAMILIES OF DISCRETE DISTRIBUTIONS 


2.2.1 Binomial Distributions 


Binomial distributions correspond to random variables that count the number of suc- 
cesses among N independent trials having the same probability of success. Such trials 
are called Bernoulli trials. The probabilistic model of Bernoulli trials is applicable in 
many situations, where it is reasonable to assume independence and constant success 
probability. 

Binomial distributions have two parameters N (number of trials) and 0 (success 
probability), where N is a positive integer and0 < 6 < 1. The probability distribution 
function is denoted by b(i; N, @) and is 


b(i; N, 0) = ("Joa -o, i=0,1,...,N. (2.2.1) 
L 
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The c.d.f. is designated by B(i; N,@), and is equal to BUi; N, 0) = GE N,6). 
j=0 

The Binomial distribution formula can also be expressed in terms of the incomplete 

beta function by 


N 
Y° (i; N, 0) = In(a, N —a +1), a=1,...,N (2.2.2) 
j=a 
where 
1 é 
I:(p,q) = / uP! —u)t'du, O< é <i. (2.2.3) 
B(p,q) Jo 


I 
The parameters p and q are positive, i.e.,0 < p,g < 0; B(p,q)= ik wid 
0 


u)’—'du is the (complete) beta function. Or 


BG:N,O)=1-hG+1,N-D=h@(N-Li+t); 1=0;..2.,N—1. 
(2.2.4) 


The quantiles B~!(p; N,6), 0 < p <1, can be easily determined by finding the 
smallest value of i at which B(i; N, 0) > p. 


2.2.2 Hypergeometric Distributions 


The hypergeometric distributions are applicable when we sample at random without 
replacement from a finite population (collection) of N units, so that every possible 
sample of size n has equal selection probability, 1/ at iB If X denotes the number of 
units in the sample having a certain attribute, and if M is the number of units in the 
population (before sampling) having the same attribute, then the distribution of X is 
hypergeometric with the probability density function (p.d_f.) 


WG) 
hG;N,M,n)= “A>, i =0,. 
n 
The c.d.f. of the hypergeometric distribution will be denoted by H(i; N, M, n). When 
n/N is sufficiently small (smaller than 0.1 for most practical applications), we can 


approximate H(i; N, M,n) by B(i;n, M/N). Better approximations (Johnson and 
Kotz, 1969, p. 148) are available, as well, as bounds on the error terms. 


yn. (2.2.5) 
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2.2.3 Poisson Distributions 


Poisson distributions are applied when the random variables under consideration 
count the number of events occurring in a specified time period, or on a spatial 
area, and the observed processes satisfy the basic conditions of time (or space) 
homogeneity, independent increments, and no memory of the past (Feller, 1966, 
p. 566). The Poisson distribution is prevalent in numerous applications of statistics to 
engineering reliability, traffic flow, queuing and inventory theories, computer design, 
ecology, etc. 

A random variable X is said to have a Poisson distribution with intensity 1, 
0 <A < ©, if it assumes only the nonnegative integers according to a probability 
distribution function 

i 


ri 
pi;4)=e"*—, i=0,1.... (2.2.6) 
im 


The c.d.f. of such a distribution is denoted by P(i; A). 

The Poisson distribution can be obtained from the Binomial distribution by let- 
ting N > oo, 0 > 0 so that N@ — i, where 0 < A < & (Feller, 1966, p. 153, or 
Problem 5 of Section 1.10). For this reason, the Poisson distribution can provide a 
good model in cases of counting events that occur very rarely (the number of cases 
of a rare disease per 100,000 in the population; the number of misprints per page in a 
book, etc.). 

The Poisson c.d.f. can be determined from the incomplete gamma function accord- 
ing to the following formula 


1 CO 
P(k;A) = ——— i xke*dx, (2.2.7) 
Pk+) SJ, 
for allk =0,1,..., where 
CO 
T(p) = i ge dx. ps0 (2.2.8) 
0 
is the gamma function. 


2.2.4 Geometric, Pascal, and Negative Binomial Distributions 


The geometric distribution is the distribution of the number of Bernoulli trials until 
the first success. This distribution has therefore many applications (the number of 
shots at a target until the first hit). The probability distribution function of a geometric 
random variable is 


a(i;6) =e —96)!, i=1,2,... (2.2.9) 


where 6, 0 < 6 < 1, is the probability of success. 
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If the random variable counts the number of Bernoulli trials until the v-th success, 
v = 1,2,..., we obtain the Pascal distribution with p.d_f. 


-—1 


ath:6.9)=(/ ara = ay, ao ees (2.2.10) 
v- 


The geometric distributions constitute a subfamily with v = 1. Another family of 
distributions of this type is that of the Negative-Binomial distributions. We designate 
by NB(w, v),0 < w < 1,0 < v < w, arandom variable having a Negative-Binomial 
distribution if its p.d.f. is 


b(i: ee ee ee 2.2.11 
nbs VY) = 5 ra) —wy)ywv', 1=U,1,.... (2.2. ) 


Notice that if X has the Pascal distribution with parameters v and 0, then X — v 
is distributed like NB(1 — 0, v). The probability distribution of Negative-Binomial 
random variables assigns positive probabilities to all the nonnegative integers. It can 
therefore be applied as a model in cases of counting random variables where the 
Poisson assumptions are invalid. Moreover, as we show later, Negative-Binomial 
distributions may be obtained as averages of Poisson distributions. The family of 
Negative-Binomial distributions depend on two parameters and can therefore be 
fitted to a variety of empirical distributions better than the Poisson distributions. 
Examples of this nature can be found in logistics research in studies of population 
growth with immigration, etc. 

The c.d.f. of the NB(y, v), to be designated as NB(i; w, v), can be determined by 
the incomplete beta function according to the formula 


NBk3W.v) =h_yw.k+), k=0,1.... (2.2.12) 


A proof of this useful relationship is given in Example 2.3. 


2.3. SOME FAMILIES OF CONTINUOUS DISTRIBUTIONS 


2.3.1 Rectangular Distributions 


A random variable X has a rectangular distribution over the interval (0), 02), —oo < 
0, < 0) < on, if its p.d_f. is 


——, if0<x< 
Fr(x301, 02) = {2-1 (2.3.1) 


0, otherwise. 


The family of all rectangular distributions is a two-parameter family. We denote r.v.s 
having these distributions by R(61, 62); —oo < 0; < 02 < oo. We note that if X is 
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distributed as R(6;, 02), then X is equivalent to 6, + (02 — 6,)U, where U ~ R(O, 1). 
This can be easily verified by considering the distribution functions of R(61, 62) 
and of R(O, 1), respectively. Accordingly, the parameter ~ = 6; can be considered a 
location parameter and 6 = 6) — 6; is ascale parameter. Let fy(x) = 1{0< x < l} 
be the p.d.f. of the standard rectangular r.v. U. Thus, we can express the p.d.f. of 
R(O1, 62) by the general presentation of p.d.f.s in the location and scale parameter 
models; namely 


1 Sled al 
fr(X3 1, 2) = tu : 00 <x <0. (2.3.2) 
6. — 01 


The standard rectangular distribution function occupies an important place in the 
theory of statistics. One of the reasons is that if a random variable has an arbitrary 
continuous distribution function F(x), then the transformed random variable Y = 
F(X) is distributed as U. For each €, 0 < € < 1, let 


Fy (€) = inf{x : F(x) = €} =€. (2.3.3) 
Accordingly, since F(x) is nondecreasing and continuous, 
PIF(X)S 8) =Pix =P ©} = Fre) sé. (2.3.4) 


The transformation X — F(X) is called the Cumulative Probability Integral 
Transformation. 
Notice that the pth quantile of R(6,, 02) is 


Ri, 92) = A + p(® — 4). (2.3.5) 


The following has application in the theory of testing hypotheses. 
If X has a discrete distribution F(x) and if we define the function 


A(x, y) = F(x —0)+ y[F(x) — Fx — 0)], (2.3.6) 


where —oo < x < oc and0<y <1, then A(X, VU) has a rectangular distribution 
as R(O, 1), where U is also distributed like R(O, 1), independently of X. We notice 
that if x is a jump point of F(x), then H(x, y) assumes a value in the interval 
[F(x — 0), F(x)]. On the other hand, if x is not a jump point, then H(x, y) = F(x) 
for all y. Thus, for every p,0 < p <1, 


A(x, y) < p if and only if 
x < F-\(p) or x= F-\(p) and y < y(:~), 
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where 


p— F(F(p)- 9) 
POO yf Ea) sy) 


y(p)= (2.3.7) 


Accordingly, for every p,0 < p < 1, 


P{H(X,U) < p} = P{X < F"'(p)} + P{U < y(p)}P{X = F7'(p)} 
= F(F"'(p)—0) + y(pyLF(F"|(p)) — F(F'(p) — 0)] = p. 
(2.3.8) 


2.3.2 Beta Distributions 


The family of Beta distributions is a two-parameter family of continuous distributions 
concentrated over the interval [0, 1]. We denote these distributions by B(p, q);0 < p, 
q < o. The p.d-f. of a B(p, q) distribution is 


f@ip.@)= a Nae Oe Ss 1 (2.3.9) 


1 
B(p,q) 


The R(0, 1) distribution is a special case. The distribution function (c.d.f.) of B(p, q) 
coincides over the interval (0, 1) with the incomplete Beta function (2.3.2). Notice 
that 


Ie(p,q) =1—_¢(q, p), forall 0<€ <1. (2.3.10) 


Hence, the Beta distribution is symmetric about x = .5 if and only if p = q. 


2.3.3. Gamma Distributions 


The Gamma function I(p) was defined in (2.2.8). On the basis of this function 
we define a two-parameter family of distribution functions. We say that a random 
variable X has a Gamma distribution with positive parameters A and p, to be denoted 
by G(A, p), if its p.d.f. is 


nP 
fA, p= To ° 0<x <oo. (2.3.11) 
Pp 


2! is a scale parameter, and p is called a shape parameter. A special important case 
is that of p = 1. In this case, the density reduces to 


f(xja) =ae™*, O0<x<00. (2.3.12) 
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This distribution is called the (negative) exponential distribution. Exponentially 
distributed r.v.s with parameter A are denoted also as E(A). 

The following relationship between Gamma distributions explains the role of the 
scale parameter A~! 


1 
G(A, p) ~ Fae Pp), forall 2. (2.3.13) 


Indeed, from the definition of the gamma p.d.f. the following relationship holds for 
allé,0O<& <a, 


\P 
T(p) Jo 


ee Pole-*dx = P ~G(1 < 
=p, Pm se]. 


In the case of A = 5 and p = v/2,v = 1,2, ... the Gamma distribution is also called 
chi-squared distribution with v degrees of freedom. The chi-squared random variables 
are denoted by x?[v], i-e., 


: 
P{GQ, p) <é} = ae ede 


(2.3.14) 


pale”, waa (2.3.15) 
Xx 2 3 2 ’ a 9 tee ee a . . 
The reason for designating a special name for this subfamily of Gamma distributions 
will be explained later. 


2.3.4 Weibull and Extreme Value Distributions 


The family of Weibull distributions has been extensively applied to the theory of 
systems reliability as a model for lifetime distributions (Zacks, 1992). It is also used 
in the theory of survival distributions with biological applications (Gross and Clark, 
1975). We say that a random variable X has a Weibull distribution with parameters 
(A,a,&);0<2,0 <a < w; -00 < € < w, if (X — &)* ~ G(A, 1). Accordingly, 
(X — &)* has an exponential distribution with a scale parameter 4~', & is a location 
parameter, i.e., the p.d.f. assumes positive values only for x > €. We will assume 
here, without loss of generality, that § = 0. The parameter @ is called the shape 
parameter. The p.d-f. of X, for & = Ois 


fw(x3a, a) = Aax*! exp{—Ax%}, O0<x <0, (2.3.16) 
and its c.d.f. is 


1 — exp{—Ax*}, x>0 
Fy(x3A, 0) = (2.3.17) 
0, x <0. 
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The extreme value distribution (of Type J) is obtained from the Weibull distribution 
if we consider the distribution of Y = — log X, where X* ~ G(A, 1). Accordingly, 
the c.d.f. of Y is 


P{Y <n} = exp{—Ae*"}, (2.3.18) 
—0oo <7 < oo, and its p.d_f. is 
fev(xsA, a) = Aa exp{—ax — he}, (2.3.19) 


—0O <x < OO. 
Extreme value distributions have been applied in problems of testing strength of 
materials, maximal water flow in rivers, biomedical problems, etc. (Gumbel, 1958). 


2.3.5 Normal Distributions 


The normal distribution occupies a central role in statistical theory. Many of the 
statistical tests and estimation procedures are based on statistics that have distributions 
approximately normal in a large sample. 

The family of normal distributions, to be designated by N(é, 07), depends on 
two parameters. A location parameter £, —oo < € < ow and a scale parameter o, 
0 <o < om. The p.d-f. of a normal distribution is 


On ee (8) (2.3.20) 
POO peg ; "2 


-—0O <x < 00. 

The normal distribution with € = 0 and o = | is called the standard normal 
distribution. The standard normal p.d.f. is denoted by @(x). Notice that N(E, 07) ~ 
— +oN(0, 1). Indeed, since o > 0, 


x 2 
2 1 l/fy-é 
PING,0?) <3) = | o0|-3(254) | 


xo 
= wl : exp {—52'| de = PE Fone 1) < x}. 
TT J—co 
(2.3.21) 


According to (2.3.21), the c.d.f. of N(é, a”) can be computed on the basis of the 
standard c.d.f. The standard c.d.f. is denoted by ®(x). It is also called the standard 
normal integral. Efficient numerical techniques are available for the computation of 
@(x). The function and its derivatives are tabulated. Efficient numerical approxima- 
tions and asymptotic expansions are given in Abramowitz and Stegun (1968, p. 925). 
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The normal p.d.f. is symmetric about the location parameter €. From this symmetry, 
we deduce that 


d(x) = d(-x), all —co < x < co 
(2.3.22) 
@®(—x) =1-—(x), all -w <x < ow. 


. . _72 
By a series expansion of e~' /? 


formula 


and direct integration, one can immediately derive the 


(—1)ix2/t! 


1 [oe] 
Jin 2 j2IQj +1)" 


1 
O(x) = 5 + CO<xX <M. (2.3.23) 


The computation according to this formula is often inefficient. An excellent comput- 
ing formula was given by Zelen and Severo (1968), namely 


(x) = 1 — P(x)[bit + bot? +--+ dbs] +(x), x>0, (2.3.24) 


where t = (1+ px)7!, p = .2316419; b, = 3193815; by = —.3565638; b3 = 
1.7814779; ba = —1.8212550; bs = 1.3302744. The magnitude of the error term 
is |e(x)| < 7.5- 1078. 


2.3.6 Normal Approximations 


The normal distribution can be used in certain cases to approximate well, the cumula- 
tive probabilities of other distribution functions. Such approximations are very useful 
when it becomes too difficult to compute the exact cumulative probabilities of the 
distributions under consideration. For example, suppose X ~ B(100, .35) and we 
have to compute the probability of the event {X < 88}. This requires the computation 
of the sum of 89 terms in 


88/100 
; J 
j=0 


Usually, such a numerical problem requires the use of some numerical approximation 
and/or the use of a computer. However, the cumulative probability B(88 | 100, .35) 
can be easily approximated by the normal c.d.f. This approximation is based on the 
celebrated Central Limit Theorem, which was discussed in Section 1.12. Accordingly, 
if X ~ B(n, @) and n is sufficiently large (relative to @) then, for 0 < ki < ky <n, 


ky + 4 —n0 ky —1-—n0 
ry exemi~o(® 2 “) (4 2 =). (2.3.25) 


Jno — 0) Jno — 0) 


The symbol * designates a large sample approximation. 

The maximal possible error in using this approximation is less than .14[n0(1 — 
6)|~'/* (Johnson and Kotz, 1969, p. 64). The approximation turns out to be quite 
good, even if n is not very large, if 6 is close to 0) = .5. In Table 2.1, we compare the 
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Table 2.1 Normal Approximation to the Binomial c.d.f. n = 25 


6=.5 6=.4 6 = .25 


= 


Exact Approx. Exact Approx. Exact Approx. 


0.000000 0.000001 0.000003 0.000053 0.000753 0.003956 
0.000001 0.000005 0.000047 0.000260 0.006271 0.014120 
0.000010 0.000032 0.000426 0.001100 0.031356 0.041632 
0.000078 0.000159 0.002364 0.003982 0.095462 0.102012 
0.000455 0.000687 0.009468 0.012372 0.212988 0.209462 
0.002039 0.002555 0.029359 0.033096 0.377526 0.364517 
0.007317 0.008198 0.073562 0.076521 0.560346 0.545964 
0.021643 0.022750 0.153549 0.153717 0.725754 0.718149 
0.053876 0.054799 0.273529 0.270146 0.849810 0.850651 
0.114761 0.115070 0.424614 0.419128 0.927919 0.933337 
0.212178 0.211855 0.585772 0.580872 0.969578 0.975176 
0.345019 0.344578 0.732279 0.729854 0.988513 0.992343 
0.500000 0.500000 0.846229 0.846283 0.995877 0.998054 
0.654981 0.655422 0.922196 0.923479 0.998332 0.999594 
0.787822 0.788145 0.965606 0.966904 0.999033 0.999931 
0.885238 0.884930 0.986828 0.987628 0.999204 0.999990 
0.946124 0.945201 0.995671 0.996018 0.999240 0.999999 
0.978357 0.977250 0.998792 0.998900 0.999246 1.000000 
0.992683 0.991802 0.999716 0.999740 0.999247 1.000000 
0.997961 0.997445 0.999944 0.999947 0.999247 1.000000 


PRP Pe RP RP PRP Re 
OMmAAINDMNBPWNYrF TOANANIADUNHPWNFK CO 


20 0.999545 0.999313 0.999939 0.999991 0.999247 1.000000 
21 0.999922 0.999841 0.999996 0.999999 0.999247 1.000000 
22 0.999990 0.999968 0.999997 1.000000 0.999247 1.000000 
23 0.999999 0.999995 0.999997 1.000000 0.999247 1.000000 
24 1.000000 0.999999 0.999997 1.000000 0.999247 1.000000 
25 1.000000 1.000000 0.999997 1.000000 0.999247 1.000000 


numerically exact c.d.f. values of the Binomial distribution B(k;n, 0) with n = 25 
(relatively small) and 6 = .25, .40, .50 to the approximation obtained from (2.3.25) 
with k = k, andk; = 0. 

Considerable research has been done to improve the Normal approximation to the 
Binomial c.d.f. Some of the main results and references are provided in Johnson and 
Kotz (1969, p. 64). 

In a similar manner, the normal approximation can be applied to approximate the 
Hypergeometric c.d.f. (Johnson and Kotz, 1969, p. 148); the Poisson c.d.f. (Johnson 
and Kotz, 1969, p. 99) and the Negative-Binomial c.d.f. (Johnson and Kotz, 1969, 
p. 127). 

The normal distribution can provide also good approximations to the G(A, v) 
distributions, when v is sufficiently large, and to other continuous distributions. For 
a summary of approximating formulae and references see Johnson and Kotz (1969) 
and Zelen and Severo (1968). In Table 2.2 we summarize important characteristics 
of the above distribution functions. 
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2.4 TRANSFORMATIONS 


2.4.1 One-to-One Transformations of Several Variables 


Let X,,..., X; be random variables of the continuous type with a joint p.d-f. 
F(X, ..., XK). Let yj; = gi(aj,..., x), i = 1,..., k, be one-to-one transformations, 
and let x; = WiQ1,..-, ¥e)i =1,...,k, be the inverse transformations. Assume that 
a are continuous for alli, j = 1,...,k at all points (1, ..., yg). The Jacobian of 


the transformation is 


OW; 
Fiesssy) = det ( Beja haxk): (2.4.1) 
Oyj 


where det.(-) denotes the determinant of the matrix of partial derivatives. Then the 
joint p.d.f. of (%1,..., Yq) is 


AY ++ Y= LW) VRMIIML Y= Orn---. de) (2.4.2) 


2.4.2 Distribution of Sums 


Let X,, X2 be absolutely continuous random variables with a joint p.d.f. f(x, x2). 
Consider the one-to-one transformation Yj = X,, Y2 = X; + X2. It is easy to verify 
that J(y1, y2) = 1. Hence, 


Fy,.¥o1, Y2) = fx,.x. 01, 2 — 1). 


Integrating over the range of Y; we obtain the marginal p.d.f. of Y2, which is the 
required p.d.f. of the sum. Thus, if g(y) denotes the p.d.f. of Y2 


aly) = i f(,y —x)dx. (2.4.3) 


If X; and X> are independent, having marginal p.d.f.s f;(x) and f2(x), the p.d-f. of 
the sum g(y) is the convolution of f\(x) and f(x), Le., 


g(y) =f fi) faly — x)dx. (2.4.4) 


If X, is discrete, the integral in (2.4.4) is replaced by a sum over the jump points of 
F(x). If there are more than two variables, the distribution of the sum can be found 
by a similar method. 


2.4.3 Distribution of Ratios 


Let X,, Xz be absolutely continuous with a joint p.d.f., f(x, x2). We wish to derive 
the p.d.f. of R = X,/X>2. In the general case, X2 can be positive or negative and 
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therefore we separate between the two cases. Over the set —00 < x1 < ~,0 < x2 < 
oo the transformation R = X;/X2 and Y = X> is one-to-one. It is also the case over 
the set —co < x1 < CO, —& < xX < 0. The Jacobian of the inverse transformation 
is J(y,r) = —y. Hence, the p.d-f. of R is 


0 ee) 
h(r) =—- / yf (yr, y)dy + [ yf (yr, y)dy. (2.4.5) 


The result of Example 2.2 has important applications. 
Let X,, X2,..., X; be independent random variables having gamma distributions 
k 
with equal A, ie., X; ~~ G(A, v;),i=1,...,k. Let T= )°X; and fori =1,..., 


i=1 


k-1 


Y; = X,/T. 


k 
The marginal distribution of Y; is B | vj, yoy ; —v¥; |. The joint distribution of 
j=l 


Y =(Y,..., Yy_1) is called the Dirichlet distribution, D(v,, v2, ..., v,), whose 
joint p.d-f. is 


yp—l 
k-1 k-1 
Py + +++ + vg) W— 
20 R= — [a a , (2.4.6) 
[]rea ae 
i=l 
k-1 
for y; = 0, )oy; <1 
j=l 
The p.d.f. of DQ, ..., vg) is a multivariate generalization of the beta distribution. 
k 
Let v* = yoy. One can immediately prove that for alli,i’=1,...,k —1 
j=l 
hey pS (2.4.7) 
me ee ED’ > 
and thus 
Ce ee (2.4.8) 
cov(Y;, Yi) = wt 1) A. 


Additional properties of the Dirichlet distributions are specified in the exercises. 
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2.5 VARIANCES AND COVARIANCES OF SAMPLE MOMENTS 


A random sample is a set of n (n > 1) independent and identically distributed 
(i.i.d.) random variables, having a common distribution F(x). We assume that F has 
all moments required in the following development. The rth moment of F, r > 1, 
1S [Ly. 

The rth sample moment is 


fy = 


a 


ye. (2.5.1) 
i=1 


We immediately obtain that 


1 n 
E{fir} = — D7 E(X}} 
ne (2.5.2) 


=r, r>1, 


since all X; are identically distributed. Notice that due to independence, 
cov(X;, X;) = 0 for all i 4 j. We present here a method for computing V{fi,} 


k 
n 
and cov{ji,, fi,") forr 4 r’. We consider expansions of the form >) Jl,k>1, 
i=l 
in terms of augmented symmetric functions and introduce the following notation 


(ys xe, (2.5.3) 
i=] 

(hil= >> > x} x?, (2.5.4) 
ixj 

[iota] = D> > SO XP XP Xe, (2.5.5) 
ix j#k 


etc. The sum of powers in such an expression is called the weight of [ ]. Thus, 
the weight of [/j/s/3] is w=1,+1.+15. In Table 2.3, we find expansions of 
(1) (bh) +--+ in terms of multi-sums es ---]. For additional values of coef- 
ficients for such expansions, see David and Kendall (1955). For example, to 


n n 2 
expand (s>x") (x) the weight is w =5, and according to Table 2.3, 
i=l i=l 


(3)(1)? = [5] + 2[41] + [32] + [3171. 


PART I: THEORY 121 


Table 2.3 Augmented Symmetric Functions in Terms of Power-Series 


Weight QO [] 
2 (2) [2] 
(y? [2] + [17] 
3 (3) [3] 
(2)(1) [3] + [21] 
(3 [3] + 3[21] + [13] 
4 (4) [4] 
(3)0) [4] + [31] 
(2)? [4] + [27] 
(2)(1)° [4] + 2[31] + [27] + [217] 
(1) [4] + 4131] + 3[27] + 6[217] + [17] 
5 (5) [5] 
(4)(1) [5] + [41] 
(3)(2) [5] + [32] 
(3)? [5] + 2[41] + [32] + [317] 
(2)°(1) [5] + [41] + 2[32] + [271] 
(2)(1)3 [5] + 3[41] + 4[32] + 3[317] + 3[271] 4 [217] 
(ly [5] + 5[41] + 10[32] + 10[317] + 15[271] + 10[217] + [1°] 
6 (6) [6] 
(5)) [6] + [51] 
(4)(2) [6] + [42] 
(4)(1?) [6] + 2[51] + [42] + [417] 
(3) [6] + [37] 
(3)(2)(1) [6] + [51] + [42] + [37] + [321] 
(3)(1)3 [6] + 3[51] + 3[42] + 3[417] + [37] + 3[321] + [31°] 
2) [6] + 3[42] + [23] 
(221)? [6] + 2[51] + 3[42] + [412] + 2[3?] + 4[321] + [23] + [2717] 
(2)(1)* [6] + 4[51] + 7[42] + 6[41°] + 4[37] + 16[32] 
+ 4[317] + 3[23] + 6[2717] + [214] 
(1)° [6] + 6[51] + 15[42] + 15[412] + 10[32] + 60[321] 


+ 20[313] + 15[27] + 45[27 17] + 15[214] + [1°] 


(*) [37] = [33], ete. 
Source: Compiled from David and Kendall (1955). 


Thus, 


(28) (Ex) -Eearye 


iA 


+DY+ DYE ye 


izj ix jxk 
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The expected values of such expansions are given in terms of product of the moments 
(independence) times the number of terms in the sum, e.g., 


E SORES = n(n — 1)p3p2. 
iZj 


2.6 DISCRETE MULTIVARIATE DISTRIBUTIONS 


2.6.1 The Multinomial Distribution 


Consider an experiment in which the result of each trial belongs to one of k alternative 
categories. Let 0’ = (0,,..., 0) be a probability vector, i.e., 0 < 6; < 1 for alli = 
k 


1,...,kand y°6; = 1.0; designates the probability that the outcome of an individual 


i=l 

trial belongs to the ith category. Consider n such independent trials, n > 1, and let 

X = (X,,..., X,%) be a random vector. X; is the number of trials in which the ith 
k 


category is realized, y ox ; =n. The distribution of X is given by the multinomial 
i=l 


probability distribution 


k 


n! i 
Gish =e, (2.6.1) 
Ey ean 


i=l 
k 
where jj =0,1,...,” and > ji =n. These terms are obtained by the multino- 


mial expansion of (0; + --- - a)". Hence, their sum equals 1. We will designate 
the multinomial distribution based on n trials and probability vector 6 by M(n, 8). 
The binomial distribution is a special case, when k = 2. Moreover, the marginal 
distribution of X; is the binomial B(n, 6;). The joint marginal distribution of any 
pair (X;, X;) where 1 <i <i’ <k is the corresponding trinomial, with probability 
distribution function 


! J ed 
: e368 — 6; — 8p)" 


Jitjiln — Ji — Ji)! (2.6.2) 
where v = jj + ji. 


PU Ji) = 


We consider now the moments of the multinomial distribution. From the marginal 
Binomial distribution of the Xs we have 


E{X;} = n6;, oo Oey 6 
V{X;}=n01—-6,), i=1,...,k. 


(2.6.3) 
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To obtain the covariance of X;, X;, i # j we proceed in the following manner. If 
n= 1 then E{X;X;}=0 for all i ¥ j, since only one of the components of X is 
one and all the others are zero. Hence, E{X;X;} — E{X;}E{X;} = —6,0; ifi A j. If 
n > 1, we obtain the result by considering the sum of n independent vectors. Thus, 


cov(X;, X;) = —n6;0;, all i # j. (2.6.4) 


We conclude the section with a remark about the joint moment generating function 


(m.g.f.) of the multinomial random vector X. This function is defined in the following 
k-1 


manner. Since X; =n — y°Xi, we define for every k > 2 


i=1 


k-1 
M(t, .--,te-1) = bord rxif (2.6.5) 
i=1 


One can prove by induction on k that 


i=l 


k-1 k-1 i 
MG eS bp O;e% + (: = >) . (2.6.6) 
i=1 


2.6.2 Multivariate Negative Binomial 


Let X = (X,,..., X,) be a k-dimensional random vector. Each random variable, X;, 
i=1,...,k, can assume only nonnegative integers. Their joint probability distribu- 


tion function is given by 
k 
r ( + >i] 


k Hak 
aCjts +5 es, v) = — (: ay [[e. (67) 


rw] [rGit+ D 


i=1 


where ji,..-, Jj, =0,1,...5 O< v < =, 0< 86; <1 for each i=1,...,k and 
k 


5-6; < 1. We develop here the basic theory for the case of k = 2. (For k = | the 
i=l 

distribution reduces to the univariate NB(@, v). Summing first with respect to j2 we 
obtain 


Pv + fC — 0 — "6" 
(1 — ATO + D° 


CO 
Y— gins j23 01, %, v) = (2.6.8) 


p=0 
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Hence, the marginal of X; is 
; bo 
P{X, = fi} =nb ivy — go? i=0,1,... (2.6.9) 
— 8 


where nb(j; W, v) is the p.d.f. of the negative binomial NB(y, v). By dividing the 


joint probability distribution function g(j1, j2; 01, 62, v) by nb | ji; <a v}, we 
— 0 

obtain that the conditional distribution of X2 given X, is the negative binomial 

NB(@2, v + Xj). Accordingly, if NB(6,, 62, v) designates a bivariate negative bino- 


mial with parameters (0), 62, v), then the expected value of X; is given by 
E{X;} =v6;/A-—@,-—6), i=1,2. (2.6.10) 
The variance of the marginal distribution is 
V{X1} = v0,(1 — 63)/(1 — 0, — 2)’. (2.6.11) 
Finally, to obtain the covariance between X, and X> we determine first 


E{X Xo} = E{X, E{X2 | Xi} 


NAD ACE ® Oy oe Ce eo ee 
= —_ Vv = V(V a Se = 
ee ; (1 — 6, — 
Therefore, 
v0105 
X 1, X29) = ——_"—_.. 2.6.13 
cov(X1, X2) (1—6, — by ( ) 


We notice that, contrary to the multinomial case, the covariances of any two compo- 
nents of the multivariate negative binomial vector are all positive. 


2.6.3 Multivariate Hypergeometric Distributions 


This family of k-variate distributions is derived by a straightforward generalization 


of the univariate model. Accordingly, suppose that a finite population of elements 
k 


contain M, of type 1, M2 of type 2,..., My of type k and N — YOM of other 


i=1 
types. A sample of n elements is drawn at random and without replacement from this 
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population. Let X;,i = 1,...,k denote the number of elements of type i observed 
in the sample. The p.d.f. of X = (Xj,..., Xx) is 


7 
s (2.6.14) 


One immediately obtains that the marginal distributions of the components of 


X are hypergeometric distributions, with parameters (NV, M;,n),i =1,...,k. If we 

designate by H(N, M,,..., My, n) the multivariate hypergeometric distribution, then 

the conditional distribution of (X;+1,..., Xx) given (% = ji,...,X, = j,) is the 
Fr 


hypergeometric H | N — OM, My41,-.-,;Mz,n -— Si . Using this result and 


i=l 
the law of the iterated expectation we obtain the following result, for alli 4 j, 


(2.6.15) 


n—-1 M; M; 
cov(x;, X;) = —n{ 1 : 


N-1/) N N° 


This result is similar to that of the multinomial (2.6.4), which corresponds to sampling 
with replacement. 


2.7 MULTINORMAL DISTRIBUTIONS 


2.7.1 Basic Theory 


A random vector (X1,..., Xx) of the continuous type has a k-variate multinormal 
distribution if its joint p.d.f. can be expressed in vector and matrix notation as 


f%1,...5 XK) = 


1 1 ae 
Sore?| 5% g)V x9} (2.7.1) 


for —co < §& <oo,i=1,...,k. Here, x = (x,..., x4), & = (&,..., &)’. V is 
ak x k symmetric positive definite matrix and |V| is the determinant of V. We 
introduce the notation X ~ N(&, V). We notice that the k-variate multinormal p.d_f. 
(2.7.1) is symmetric about the point &. Hence, & is the expected value (mean) vector 
of X. Moreover, all the moments of X exist. 
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The m.g.f. of X is 
M(ty,...,t) = Efexp(t'X)} 
1 (2.7.2) 
= exp (sev + ts) : 


To establish formula (2.7.2) we can assume, without loss of generality, that € = 0, 
since if Mx(t) is the m.g.f. of X and Y = X +b, then the m.g.f. of Y is My(t) = 
exp(t’b)Mx(t). Thus, we have to determine 


1 CO CO 1 1 k 
M(t) = ———_———~ He 4 t' Su 'V- d. a 2.7.3 
(0 (27 )k/2| V | 1/2 iD fe ee ( ae ‘) I] pe 


Since V is positive definite, there exists a nonsingular matrix D such that V = 
DD’. Consider the transformation Y = D~'X; then x/V~!x = y’y and t’x = t’Dy. 
Therefore, 


1 1 1 
Sexy s +tx= =a — D't)(y — D't)+ 5t Vt (2.7.4) 


Finally, the Jacobian of the transformation is |D| and 


M(t) tive iB in 
= eX: . see 
PhS Qnyk2|VI2 Js. 


fone) 1 k 
i exp (-30 — Div’y- v's) | [a- (2.7.5) 


i=1 


Since |D| = |V|!/* and (27)? times the multiple integral on the right-hand side 
is equal to one, we establish (2.7.2). In order to determine the variance—covariance 
matrix of X we can assume, without loss of generality, that its expected value is zero. 
Accordingly, for all i, j, 


a2 
cov(X;, Xj) = ana M Olio: (2.7.6) 
ie] 


From (2.7.2) and (2.7.6), we obtain that cov(X;, X;) = o;;. (i, j = 1,...,k), where 
o;; is the (i, j)th element of V. Thus, V is the variance—covariance matrix of X. 
Ak-variate multinormal distribution is called standard if €; = 0 and o;; = 1 forall 
i=1,...,k.Inthis case, the variance matrix will be denoted by R since its elements 
are the correlations between the components of X. A standard normal vector is often 
denoted by Z, its joint p.d.f. and c.d.f. by ¢x(z | R) and ®;(z | R), respectively. 
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2.7.2 Distribution of Subvectors and Distributions of Linear Forms 


In this section we present several basic results without proofs. The proofs are straight- 
forward and the reader is referred to Anderson (1958) and Graybill (1961). 

Suppose that a k-dimensional vector X has a multinormal distribution N(w, V). We 
consider the two subvectors Y and Z, i.e., X’ = (Y’, Z’), where Y is r-dimensional, 
l<r<k. 

Partition correspondingly the expectation vector & to &’ = (y’, ¢’) and the covari- 


ance matrix to 
Vir Vio 
Y= 
Vo, Voo 


The following results are fundamental to the multinormal theory. 
@) Y ~ N(@, Vii) 
(ii) Z~ N(G, V2) 
(iii) Y | Z~ N+ Vi2Vo5'(Z— €), Vir — Vi2V59! V1), 


and an analogous formula can be obtained for the conditional distribution of Z 
given Y. 
The conditional expectation 


E{Y |Z} =9+ VioVn (Z—6) (2.7.7) 
is called the linear regression of Y on Z. The conditional covariance matrix 
MY | Z) = Vir — Vi2V55' Von (2.7.8) 


represents the variances and covariances of the components of Y around the linear 
regression hyperplane. The above results have the following converse counterpart. 
Suppose that Y and Z are two vectors such that 


(i) Y| Z~ N(AZ, V) 
and 
(ii) Z ~ N(¢, D); 
then the marginal distribution of Y is the multinormal 


Y ~ N(Ag, V + ADA’) 
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and the joint distribution of Y and Z is the multinormal, with expectation vector 
(¢'A’, ¢’) and a covariance matrix 


V+ADA' AD 
DA' DI] 


Finally, if X ~ N(é, V) and Y=b-+ AX, then Y ~ N(b+ A, AVA’). That is, 
every linear combination of normally distributed random variables is normally 
distributed. 

In the case of k = 2, the multinormal distribution is called a bivariate normal 
distribution. The joint p.d.f. of a bivariate normal distribution is 


1 
2n0100,/1 — 0? , 


1 e=-§\) . x=b y=n fy-n\ 
o| | ( onl ) 2p (on 02 +( 02 |}: 


F(X, ys &, Mb, 01, 02, P) = 


—00 <x,y < OO. 
The parameters € and 7 are the expectations, and op and Ge, are the variances of 
X and Y, respectively. p is the coefficient of correlation. 
The conditional distribution of Y given {X = x} is normal with conditional expec- 
tation 


EtY |x}=n+B@-—), (2.7.10) 
where 8 = (02/0;. The conditional variance is 
ony = 0; (1 — p’). (2.7.11) 


These formulae are special cases of (2.7.7) and (2.7.8). Since the joint p.d.f. of (X, Y) 
can be written as the product of the conditional p.d.f. of Y given X, with the marginal 
p.d.f. of X, we obtain the expression, 


fx, 38,9, 01,02, p) = “6 (==), 1 (yon Be -8) 
9959's ’ , onl (onl 0/1 — p2 onl — p2 £ 


(7.19) 
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This expression can serve also as a basis for an algorithm to compute the Bivariate- 
Normal c.d.f., i.e., 


(xo—-&)/o1 tle Ga ces 
P{X <x,¥ < yo} =f eye: | BO We 10713) 
—oo O2V 1- p? 


Let Z1, Z2 and Z3 have a joint standard Trivariate-Normal distribution, with a 
correlation matrix 


1 pr pi 
R=] pir 1 p23 
P3 p23 =I 


The conditional Bivariate-Normal distribution of (Z,, Z2) given Z3 has a covariance 
matrix 


1 — p? = 
Ve P13 P12 ae (2.7.14) 
P12 — (1323 1 — p3, 


The conditional correlation between Z, and Z2, given Z3 can be determined from 
(2.7.14). It is called the partial correlation of Z,, Z. under Z3 and is given by 


io ee (2.7.15) 
Cl = pes) = 135) 
2.7.3 Independence of Linear Forms 
Let X = (X,..., X,)/ be a multinormal random vector. Without loss of generality, 


assume that E{X} = 0. Let V be the covariance matrix of X. We investigate first the 


conditions under which two linear functions Y; = a’X and Y) = B’X are independent. 
/ 


Let Y= (|, Y2), A= ve . That is, A is a2 x k matrix and Y = AX. Y has 


a bivariate normal distribution with a covariance matrix AV A’. Y; and Y> are inde- 
pendent if and only if cov(Y,;, Y2) = 0. Moreover, cov(Y;, Y2) = a’VB. Since V is 
positive definite there exists a nonsingular matrix C such that V = CC’. Accordingly, 
cov(Y;, Y2) = 0 if and only if (C’a)'(C’B) = 0. This means that the vectors C’a and 
C’B should be orthogonal. This condition is generalized in a similar fashion to cases 
where Y; and Y> are vectors. Accordingly, if Y; = AX and Y, = BX, then Y, and 
Y> are independent if and only if AVB’ = 0. In other words, the column vectors of 
CA’ should be mutually orthogonal to the column vectors of C’B. 
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2.8 DISTRIBUTIONS OF SYMMETRIC QUADRATIC FORMS OF 
NORMAL VARIABLES 


In this section, we study the distributions of symmetric quadratic forms in normal 
random variables. We start from the simplest case. 


Case A: 


X ~ N(0,07), O = X?. 


Assume first thato? = 1. The density of X is then d(x) = a exp(— $x), Therefore, 


Jan 
the p.d.f. of Q is 


1 1 1 
fol) = A mee exp (-35") + exp (-3-v9") | 


1 
~ arg)” 


(2.8.1) 
“126-37, O<y<o0, 


since r(5) =/t. 

Comparing fo(y) with the p.d.f. of the gamma distributions, we conclude that 
if o? =1 then Q~ GG, 5) ~ x7[1. In the more general case of arbitrary a, 
Oa ae oe EE 
Case B: 


X ~ N(é, 0”), O = X?. 


This is a more complicated situation. We shall prove that the p.d.f. of Q (and so its 
c.d.f. and m.g.f.) is, at each point, the expected value of the p.d.f. (or c.d.f. or m.g.f.) 
of o7x7[1 + 2J], where J is a Poisson random variable with mean 


A= as" (2.8.2) 
= 5525 ° 8. 
Such an expectation of distributions is called a mixture. The distribution of Q when 
o* = 1 is called a noncentral chi-squared with 1 degree of freedom and parameter 
of noncentrality A. Insymbols Q ~ x7[1;4]. When A = 0, the noncentral chi-squared 
coincides with the chi-squared, which is also called central chi-squared. The proof 
is obtained by determining first the m.g.f. of Q. As before, assume that o? = 1. Then, 


Mo(t) = a iz exp (1 = sex 7 ) dx. (2.8.3) 
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Write, for all t < 5, 


1—2t — 2t° 
Thus, 
Mo(t) = exp (# <) (=p SL: 
Furthermore, ¢/(1 — 2r) = —5 py $(1 — 2t)~!. Hence, 


Mo(t) =e? ae at) 2t9), 
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(2.8.4) 


(2.8.5) 


(2.8.6) 


According to Table 2.2, (1 — 2r)~G*+/) is the m.g-f. of x2[1 + 2/]. Thus, according 
to (2.8.6) the m.g.f. of ¥ [1s A] is the mixture of the m.g.f.s of x7[1 + 2J], where J 
has a Poisson distribution, with mean A as in (2.8.2). This implies that the distribution 
of x7[1;A] is the marginal distribution of X in a model where (X, J) have a joint 
distribution, such that the conditional distribution of X given {J = /} is like that of 
x°[1 + 2j] and the marginal distribution of J is Poisson with expectation A. From 
Table 2.2, we obtain that E{x?[v]} = v and Vix} = 2v. Hence, by the laws of 


the iterated expectation and total variance 
E{x7[1;a]} = 1+2a 
and 
V{x7[1: Al} = 2(1 + 4a). 


Case C: 
X1,..., Xp, are independent; X; ~ N(&, 07), i = Ty en; 


= x? 
i=1 


It is required that all the variances o? are the same. As proven in Case B, 


XPorg al Sor I aS To eat 


where J; ~ P(A;). 


(2.8.7) 


(2.8.8) 


(2.8.9) 
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Consider first the conditional distribution of Q given (Jj, ..., J;,). From the result 
on the sum of independent chi-squared random variables, we infer 


1 Urata] nea 34] (2.8.10) 


i=1 


where Q | (Jj,..., J,) denotes the conditional equivalence of the random variables. 
Furthermore, since the original X;s are independent, so are the J;s and therefore 


Fy speech In POG Persp Ag): (2.8.11) 


Hence, the marginal distribution of Q is the mixture of o*x?[n + 2M] where 
M ~ P(A, +---+A,). We have thus proven that 


O~ ox? [nay tee b Aad: (2.8.12) 
Case D: 
X~ N(E, V) and O = X'AX, 


where A is a real symmetric matrix. The following is an important result. 
2 : 1 ! 
Q~ x'[r;A], with A= 58 Ag (2.8.13) 


if and only if VA is an idempotent matrix of rank r (Graybill, 1961). The proof is 
based on the fact that every positive definite matrix V can be expressed as V = CC’, 
where C is nonsingular. If Y= C~!X then Y ~ N(C~!, I) and X’AX = Y'C'ACY. 
C’AC is idempotent if and only if VA is idempotent. 

The following are important facts about real symmetric idempotent matrices. 


(i) A is idempotent if A? = A. 
(ii) All eigenvalues of A are either 1 or 0. 


(iii) Rank (A) = tr.{A}, where tr.{A} = Ai is the sum of the diagonal ele- 
i=l 
ments of A. 


(iv) The only nonsingular idemptotent matrix is the identity matrix I. 


2.9. INDEPENDENCE OF LINEAR AND QUADRATIC FORMS 
OF NORMAL VARIABLES 


Without loss of generality, we assume that X ~ N(O, J). Indeed, if X ~ N(O, V) and 
V =CC’ make the transformation X* = C~!X, then X* ~ N(0, 7). Let Y= BX 
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and Q = X’AX, where A is idempotent of rank r, 1 <r <k. B is ann x k matrix 
of full rank, 1 <n <k. 


Theorem 2.9.1. Y and Q are independent if and only if 
BA=0. (2.9.1) 


For proof, see Graybill (1961, Ch. 4). 
Suppose now that we have m quadratic forms X’B;X in a multinormal vector 
X~ N¢é, J). 


Theorem 2.9.2. If X ~ N(&,1) the set of positive semidefinite quadratic forms 
X’B;X (i = 1,...,m) are jointly independent and X'B;X ~ x7 [1,34], where r; is 
the rank of B; and rj; = 5&'Bié, if any two of the following three conditions are 
Satisfied. 


1, Each B; is idempotent (i = 1,...,m); 
m 
2. SB j is idempotent; 


j=l 
3. BB; =Oforalli F j. 


This theorem has many applications in the theory of regression analysis, as will 
be shown later. 


2.10 THE ORDER STATISTICS 


Let X1,..., X, be a set of random variables (having a joint distribution). The order 
statistic is 


S(X1,..., Xn) = (Xa, XQ, ---, Xm), (2.10.1) 


where X 1) < X2) Sees Xin): 

If X1,..., X, are independent random variables having an identical absolutely 
continuous distribution function F(x) with p.d.f. f(x), then the p.d.f. of the order 
Statistic is 


f(a), -++sXay) =n! I] f (X))- (2.10.2) 


i=1 
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To obtain the p.d.f. of the ith order statistic X;j), i = 1,...,n, we can integrate 
(2.10.2) over the set 


Si(€) = {-00 < Xqy < Xi S++: © XG-1) SE < Xa <+++ < XW < WY}. 


(2.10.3) 
This integration yields the p.d.f. 
ay n! i-l ni 
SE) = G—- pia pit PFS) Chey": (2.10.4) 


—oo < € < oo. We can obtain this result also by a nice probabilistic argument. 
Indeed, for all dx sufficiently small, the trinomial model yields 


P{E —dx < Xu) <E+dx}= 
_ "arene —dx)\'"[1 — F& + dx)|"'dx + o(dx), 
dG —1)'(n—-1)! 
(2.10.5) 


where o(dx) is a function of dx that approaches zero at a faster rate than dx, Le., 
o(dx)/dx > Oasdx > 0. 

Dividing (2.10.5) by 2dx and taking the limit as dx — 0, we obtain (2.10.4). The 
joint p.d.f. of (X(, X(jy) with 1 <i < j <n is obtained similarly as 


n!} 
G@—-D)IG—-1—-i!n 
[F(y) — FQ@)I LL - FO", 


00 <X< Y<w. 


fa, y) = My f@)FOMFO)IT - 


(2.10.6) 


In a similar fashion we can write the joint p.d.f. of any set of order statistics. From the 

joint p.d.f.s of order statistics we can derive the distribution of various functions of 

the order statistics. In particular, consider the sample median and the sample range. 
The sample median is defined as 


(Xen) =e Xm+1))/2, ifn =2m 


M. 
X(m+1)s ifn =2m+1. 


(2.10.7) 


That is, half of the sample values are smaller than the median and half of them are 
greater. The sample range R,, is defined as 


Rea Xi Kay (2.10.8) 
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In the case of absolutely continuous independent r.v.s, having a common density 
f(x), the density g(x) of the sample median is 


(2m + 1)! 


ape 7 Frou — F(x)}", ‘one tml 
g(x) = spate 
ced flu) fQx —u)F""\(u)[L — FQx —u)}"'du, ifn = 2m. 
m—1)!? Jeo 


(2.10.9) 


We derive now the distribution of the sample range R,,. Starting with the joint p.d-f. 
of (Xi, X(ny) 


fy) =n —DAOFOMFO)— FQ)", x<y, (2.10.10) 


we make the transformation u = x,r = y — x. 
The Jacobian of this transformation is J = | and the joint density of (u, r) is 


gu,r)=nn-DfWfutnlFutr)—- Fay. (2.10.11) 


Accordingly, the density of R,, is 
h(r) = n(n - vf. fu) fttr)[Fu +r) — F(u)}"7du. (2.10.12) 


For a comprehensive development of the theory of order statistics and interesting 
applications, see the books of David (1970) and Gumbel (1958). 


2.11 t-DISTRIBUTIONS 


In many problems of statistical inference, one considers the distribution of the ratio 
of a statistic, which is normally distributed to its standard-error (the square root of 
its variance). Such ratios have distributions called the t-distributions. More specifi- 
cally, let U ~ N(O, 1) and W ~ (x7[v]/v)!/?, where U and W are independent. The 
distribution of U/W is called the “student’s t-distribution.’ We denote this statistic 
by ¢[v] and say that U/ W is distributed as a (central) ¢[v] with v degrees of freedom. 

An example for the application of this distribution is the following. Let X1,..., Xn 
be iid. from a N(é, 07) distribution. We have proven that the sample mean X 


is distributed as N (« ; =) and is independent of the sample variance S*, where 
n 
S? ~ 07° x?7[n — 1]/(@ — 1). Hence, 


N(O, 1) 


x2[n — 1]/(m — 1)? ~ t[n — 1]. (2.11.1) 


Rk 
ge Ere 
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To find the moments of t[v] we observe that, since the numerator and denominator 
are independent, 


E{(t(v)"} = E{U"}- E{Q?v)/vy"”}. (2.11.2) 


Thus, all the existing odd moments of t[v] are equal to zero, since E{U"} = 0 
for all r=2m+1. The existence of E{(t[v])’} depends on the existence of 
E{(x?[v]/v)7"7}. We have 


E{Q?(vl/vy"7} = Gy" r (5 = 5) /T (5) (2.11.3) 


Accordingly, a necessary and sufficient condition for the existence of E{(t[v])"} is 
v > r. Thus, if v > 2 we obtain that 


E{t??[v]}} = v/(v — 2). (2.11.4) 


This is also the variance of t[v]. We notice that V{t[v]} — 1 as v > oo. It is not 
difficult to derive the p.d.f. of t[v], which is 


2\~@+D2 
1+ ) , —~O<t<oO. (2.11.5) 


1 
TOTS Fay : 


The c.d.f. of t[v] can be expressed in terms of the incomplete beta function. Due to 
the symmetry of the distribution around the origin 


P(t] <t} =1— Plt) <4, 1 <0. (2.11.6) 


We consider now the distribution of (U + €)/W, where & is any real number. 
This ratio is called the noncentral ¢ with v degrees of freedom, and parameter of 
noncentrality €. This variable is the ratio of two independent random variables namely 
N@€, 1) to (7p /v)!?. If we denote the noncentral ¢ by t[v; &], then 


t[vs€] ~ (NO, 1) +8)/GCDI/v)!”. (2.11.7) 


Since the random variables in the numerator and denominator of (2.11.7) are inde- 
pendent, one obtains 


2 T-t 
E{t[v;é]} =& (5) pas mat (2.11.8) 
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and that the central moments of orders 2 and 3 are 


v 1 2 
re ak ee: 
@ (2.11.9) 


My = V{tlv;é}} = (1+ &") 5 2 P(v/2) 


and 


(2.11.10) 


1 
ee (PV? 2 2 v(2v — 3+ &*) : 
é (=) ( 2 ) 


pe ek ro/) \w—Ddw~—=3) 


This shows that the t[v; €] is not symmetric. Furthermore, since U +&€ ~ -—U +& 
we obtain that, for all —co < E < w, 


Pit[v;€] > t} = P{tlv; -—§] < —2}. (2.11.11) 


In particular, we have seen this in the central case (§ = 0). The formulae of the p.d_f. 
and the c.d.f. of the noncentral t[v;&] are quite complicated. There exists a variety 
of formulae for numerical computations. We shall not present these formulae here; 
the interested reader is referred to Johnson and Kotz (1969, Ch. 31). In the following 
section, we provide a representation of these distributions in terms of mixtures of 
beta distributions. 

The univariate t-distribution can be generalized to a multivariate-t in a variety of 
ways. Consider an m-dimensional random vector X having a multinomial distribution 
N(E,o7R), where R is a correlation matrix. This is the case when all components of 
X have the same variance o”. Recall that the marginal distribution of 


Y; = (X; —&) ~N(@,o’), i=1,...,m. 
Thus, if $2 ~ o7 x? [v]/v independently of Y;,..., Yn, then 


Xi — & 
pest 
S 


i=1,...,m 


have the marginal f-distributions t[v]. The p.d.f. of the multivariate distribution of 


1 
t= sy is given by 


r (50+m) 1 vm 
Sti, ++ +5 tm) = * (1+ UR) ¢ HAATA2) 
(rvyn/2P (5) |R|!/2 v 
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Generally, we say that X has a t[v;&, ¥] distribution if its multivariate p.d.f. is 


v+m 
> 


1 ay 
f(1,---5 Xm &, J) x (1+ 50-p'F@-5) . (2.11.13) 


This distribution has applications in Bayesian analysis, as shown in Chapter 8. 


2.12 F-DISTRIBUTIONS 


The F-distributions are obtained by considering the distributions of ratios of two 
independent variance estimators based on normally distributed random variables. As 
such, these distributions have various important applications, especially in the anal- 
ysis of variance and regression (Section 4.6). We introduce now the F-distributions 
formally. Let x71] and x7[v>] be two independent chi-squared random variables 
with v; and v2 degrees of freedom, respectively. The ratio 


iil/y1 
Flv, v2] PO (2.12.1) 


is called an F-random variable with v; and v2 degrees of freedom. It is a straightfor- 
ward matter to derive the p.d.f. of F[v1, v2], which is given by 


Me aie 2-1 
f(%3 V1, v2) = : @ =) Gakaanmae (2.12.2) 
a 


The cumulative distribution function can be computed by means of the incomplete 
beta function ratio according to the following formula 


P{F[vy, 4] < €} = Ire) fee =) ; (2.12.3) 
where 
Re) =6— / (1 + =) : (2.12.4) 
v2 v2 


In order to derive this formula, we recall that if G (1, =) and G ( 1, =) are two 


independent gamma random variables, then (see Example 2.2) 


(4) /[o(.B+o(3))~eC.2). ens 
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Hence, 


1 
(1.5) /c (1, 5) ~ Te %) i (2.12.6) 


We thus obtain 


P{Flv1, 9] < )} =P 


(2.12.7) 


1 vy Vy Vv 
gee ee = tre (3): 
OD 


For testing statistical hypotheses, especially for the analysis of variance and regres- 
sion, one needs quantiles of the F[v,, v2] distribution. These quantiles are denoted 
by F,[v1, v2] and are tabulated in various statistical tables. It is easy to establish 
the following relationship between the quantiles of F[v,, v2] and those of F[v2, v4], 
namely, 


Fy (v1, v2] = 1/Fi-y[v2,v1], O<y <1. (2.12.8) 


The quantiles of the F[v,, v2] distribution can also be determined by those of the beta 
distribution by employing formula (2.12.5). If we denote by £,,(p, q) the values of x 
for which [,(p, g) = y, we obtain from (2.12.4) that 


pinel=2a(%2)/[-a(.3)) ens 


The moments of F[v;, v2] are obtained in the following manner. For a positive 
integer r 


V2 vy 
E((F[v1, v2])"} = (2) : Gi Z ‘ : G é i (2.12.10) 


We realize that the rth moment of F[v,, v2] exists if and only if v2 > 2r. In 
particular, 


E{F[v, v2]} = v2/(v2 — 2). (2.12.11) 
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Similarly, if v2 > 4 then 


2v3 (v1 + v2 — 2) 


MPN oes De aay 


(2.12.12) 


In various occasions one may be interested in an F-like statistic, in which the ratio 
consists of a noncentral chi-squared in the numerator. In this case the statistic is 
called a noncentral F’. More specifically, let x7[v,; 4] be a noncentral chi-squared 
with v, degrees of freedom and a parameter of noncentrality 4. Let x7[v2] be a central 
chi-squared with v2 degrees of freedom, independent of the noncentral chi-squared. 
Then 


Ty Al/v1 


mae lal 


(2.12.13) 


is called a noncentral F'[v;, v2; A] statistic. We have proven earlier that x74] ~ 
x7[v, + 2/], where J has a Poisson distribution with expected value i. For this 
reason, we can represent the noncentral F[v,, v2;4] as a mixture of central F 
statistics. 


y+2I x7[v + 271/01 +2) 
v1 x7 [v2] /v2 
vy +2/ 
V1 


Flvy, va3A] ~ 


(2.12.14) 
—§_ F[v, + 2J, v2], 


where J ~ P(A). Various results concerning the c.d.f. of F[v1, v2;A], its moments, 
etc., can be obtained from relationship (2.12.14). The c.d.f. of the noncentral F 
Statistic is 


P{F[y), 34] <é} =e = “PLE vi +2j,vo] <é/Q1+2f)}. (2.12.15) 
j= =o J 


Furthermore, following (2.12.3) we obtain 


ae 

M V1 v2 
P{F[vy1, 3A] < E} =e I ( i, )s 
(Flu. void] sé} =e Diino x eS 


where 


V1 Vy 
RE) =e / (1+ “te, 
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As in the central case, the moments of the noncentral F are obtained by employing 
the law of the iterated expectation and (2.12.14). Thus, 


vy +2/ 


E(FI. vial) = E| Fly, +2J, val. (2.12.16) 
However, for all 7 = 0,1,..., E{F[v1 + 2/, v2]} = vo/(v2 — 2). Hence, 


E{F[vy, v25A]} = vo(v1 + 2A)/(vi(v2 — 2)), 


Loy 
V feed LOR ALES i| (2.12.17) 
v1 
_ +2)" 2v5 Vy twy+2j —2 
ve (v2 — 2)?(v2 — 4) vy +2) 


Hence, applying the law of the total variance 


_ 2v3 (v1 + 2A) + 6A + v2 — 2) Aav3 
v?(v2 — 2)2(v2 — 4) v2(v2 — 2)2” 


V{Flv, v23]} (2.12.18) 


We conclude the section with the following observation on the relationship between 
t- and the F-distributions. According to the definition of t[v] we immediately obtain 
that 


lv] ~ N70, D/G7[I/v) ~ FLL, v1. (2.12.19) 


Hence, 
1 
P{-t <t[v] < t}= P{F[1, v] < 17} = Ieju4r) (5 >) : (2.12.20) 


Moreover, due to the symmetry of the t[v] distribution, fort > 0 we have 2P{t[v] < 
t}=1+ P{F[I, v] < #7}, or 


1 1 ov 
Pil si) = 5 (1 va (5. *)). (2.12.21) 


Ina similar manner we obtain a representation for P{|t[v, €]| < t}. Indeed, (N(0, 1) + 
é) ~ x7[1;A] where A = 567. Thus, according to (2.12.16) 


he al 1 op 
P{-t <t[v;é] <t}=e DL Flares stig). 2.12.22) 
j=0 7" 
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2.13 THE DISTRIBUTION OF THE SAMPLE CORRELATION 


Consider a sample of n i.i.d. vectors (X1, Y1),...,(Xn, Yn) that have a common 
bivariate normal distribution 


g Of — pa\02 
N ; . : 
1) Po12 05 
In this section we develop the distributions of the following sample statistics. 
(i) The sample correlation coefficient 
r = SPDxy /(SSDx « SSDy)'/”; (2.13.1) 
(ii) The sample coefficient of regression 
b = SPDxy/SSDx (2.13.2) 


where 


n 2 1 
SSDx = SX SK aN (1 = 1) X, 
. n 


i= 


: rane 1 
SPDxy = ) (X; — XY; -Y)=¥ (1-20) X, (2.13.3) 


i= 


1 
SSDy = bate —~Yyy=Y' (1 = 1) Y. 
n 


i= 


As mentioned earlier, the joint density of (X, Y) can be written as 


1 x—& yon BOs) 
f@,y= = ( ) @ ; (2.13.4) 
0100/1 — 9? O1 02/1 — p? 
where 8 = (o02/0,. Hence, if we make the transformation 
U; = Xi — &, 
(2.13.5) 


V; =Y; —n— B(X; -—&), i=1,...,n 
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then U; and V; are independent random variables, U; ~ N(0, 07) and V; ~ N(O, oF . 
(1 — p*)). We consider now the distributions of the variables 


Wi = SPDyv/ool( — p*)SSDu}'”, 
W> = (SSDy — SPDi,y /SSDyu)o3 (1 — p”), (2.13.6) 
W3 = SSDy/o/, 


where SSDy, SPDyy and SSDy are defined as in (2.13.3) in terms of (U;, V;), 
i=l,...,n.LetU=(W,...,U,) and V=(V,,..., V,)’. We notice that the con- 


1 
ditional distribution of SPDyy = V’ (1 —-J ) U given U is the normal N(0, o3(1 _ 
n 


p*)- SSDy). Hence, the conditional distribution of W, given U is N(0, 1). This implies 
that W; is N(O, 1), independently of U. Furthermore, W; and W3 are independent, 
and W3 ~ x2[n — 1]. We consider now the variable Wo. It is easy to check 


1 
SSDy — SPD?,,/SSDy = V' (4 = apps’) V, (2.13.7) 


1 
where A = J — —J. A is idempotent and so is B = A — 5p, AUU'A. Furthermore, 
n 


the rank of B is n — 2. Hence, the conditional distribution of SSDy — SPD2,; /SSDy 
given U is like that of os (1 — p”)x?[n — 2]. This implies that the distribution of 
W> is like that of x7 In — 2]. Obviously W2 and W3 are independent. We show now 
that W, and W2 are independent. Since SPDyy = V’AU and since BAU = (A — 
3p, AUU'A)AU = AU- wp, AU -SSDy = 0 we obtain that, for any given U, 
SPDyy and SSDy — SPD4y /SSDy are conditionally independent. Moreover, since 
the conditional distributions of SPDyy /(SSDy)'” and of SSDy — SPD. & /SSDy are 
independent of U, W; and W2 are independent. The variables W,, W2, and W3 can be 
written in terms of SSDx, SPDyy, and SSDy in the following manner. 


Wi = (SPDxy — BSPDx)/[oz(1 — p*)SSDx]""", 
W> = (SSDy — SPD /SSDx)/o3(1 — p”), (2.13.8) 
W3 = SSDx/o/. 


Or, equivalently, 


~: r.J/SSDy pV/SSDx 
02/1 — p? oil — p? 
W> = SSDy(1 — r?)/o3(1 — p”), 


(2.13.9) 


W3 = SSDx/o/. 
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From (2.13.9) one obtains that 


We =e ie an (2.13.10) 
3 = 


An immediate conclusion is that, when p = 0, 


r N(O, 1) 


This result has important applications in testing the significance of the correlation 
coefficient. Generally, one can prove that the p.d.f. of r is 


n—3 


ikea atte ayest Poe (M+ —1) Cory 
LDH Sa ee | 5 ) rie 


j=0 
(2.13.12) 


2.14 EXPONENTIAL TYPE FAMILIES 


A family of distribution *, having density functions f(x; 0) with respect to some 
o-finite measure j1, is called a k-parameter exponential type family if 


f(x 8) = h(X)A@) exp )Ui (x) + ++ + We(O)UK(™)}, (2.14.1) 


—0o <x < c,6 € O. Here ¥;(0), i = 1,..., k are functions of the parameters and 
U;(x),i = 1,...,k are functions of the observations. 

In terms of the parameters W=(W,...,%,)' and the statistics U= 
(U(x), ..., Ug(x)y’, the p.d.f of a k-parameter exponential type distribution can 
be written as 


f(s W) = h* (U(x) exp{—K (p)} exp{p'U(x)}, (2.14.2) 


where K (yw) = — log A*(). Notice that *(U(x)) > 0 for all x on the support set 
of F, namely the closure of the smallest Borel set S, such that Py{S} = 1 for all 
w. If h*(U(x)) does not depend on y, we say that the exponential type family F is 
regular. Define the domain of convergence to be 


= {v ; [wey exp{—W’U(x)}du(x) < oo ‘ (2.14.3) 


The family F is called full if the parameter space Q coincides with Q*. Formula 
(2.14.2) is called the canonical form of the p.d.f.; y are called the canonical (or nat- 
ural) parameters. The statistics U;(x)(@i = 1, ..., k) are called canonical statistics. 
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The family F is said to be of order k if (1, W,..., W,) are linearly independent 
k-1 


functions of 6. Indeed if, for example, wz, = a + ye; w,, for some ag, ..., @x—1, 
j=l 
which are not all zero, then by the reparametrization to 


wi=(ta)yj, fol....k-1, 


we reduce the number of canonical parameters tok — 1.If(1, Ww, ..., Ww) are linearly 
independent, the exponential type family is called minimal. 
The following is an important theorem. 


Theorem 2.14.1. /f Equation (2.14.2) is a minimal representation then 


(i) Q* is a convex set, and K(q) is strictly convex function on Q2*. 


(ii) K (Wh) is a lower semicontinuous function on IR‘, and continuous in the interior 


of 2*. 


For proof, see Brown (1986, p. 19). 
Let 


Mp) = / h* (U(x) exp{p'U (x)}d u(x). (2.14.4) 


Accordingly, A(w) = exp{K (w)} or K(w) = log A). A(W) is an analytic function 
on the interior of Q* (see Brown, 1986, p. 32). Thus, A(w) can be differentiated 
repeatedly under the integral sign and we have for nonnegative integers /;, such that 


k 
Sli =f, 
i=1 


1 


k 
; AW) = | | [Cie 'n* UC) - explw'U@}du(x). (2.14.5) 


[lev i=l 
i=l 
The m.g.f. of the canonical p.d.f. (2.14.2) is, for w in Q*, 


M(t; v) = [rae rorrrraye) 
(2.14.6) 


= exp{-K(w) + KW +b} 
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for t sufficiently close to 0. The logarithm of M(t; y), the cumulants generating 
function, is given here by 


K*(t}p)=-K(W)+ KW +b. (2.14.7) 
Accordingly, 


Ey{U} = VK*(t,¥)|,_» 


(2.14.8) 
= VK(yp), 
where V denotes the gradient vector, i.e., 
0 
57 KW) 
ced 
VK(p) = : 
0 
57 KW) 
ae 
Similarly, the covariance matrix of U is 
a? 
V, ©) =(—a Kw j= Licsagh (2.14.9) 
i aWidW; 


Higher order cumulants can be obtained by additional differentiation of K(w). We 
conclude this section with several comments. 


1. The marginal distributions of canonical statistics are canonical exponential type 
distributions. 

2. The conditional distribution of a subvector of canonical exponential type statis- 
tics, given the other canonical statistics, is also a canonical exponential type 
distribution. 

3. The dimension of (* in a minimal canonical exponential family of order k 


might be smaller than k. In this case we call F a curved exponential family 
(Efron, 1975, 1978). 


2.15 APPROXIMATING THE DISTRIBUTION OF THE SAMPLE MEAN: 


EDGEWORTH AND SADDLEPOINT APPROXIMATIONS 


Let X,, X2,..., X, bei.i.d. random variables having a distribution, with all required 
moments existing. 
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2.15.1 Edgeworth Expansion 


The Edgeworth Expansion of the distribution of W,, = ./n(X, — )/o, which is 
developed below, may yield more satisfactory approximation than that of the normal. 
This expansion is based on the following development. 

The p.d.f. of the standard normal distribution, ¢(x), has continuous derivatives of 
all orders everywhere. By repeated differentiation we obtain 


d(x) = —xP(x) 
(2.15.1) 
(x) = (x? — Doe(x), 


and generally, for j > 1, 
G(x) = (— 1D Hj) 0), (2.15.2) 


where H;(x) is a polynomial of order j, called the Chebychev—Hermite polynomial. 
These polynomials can be obtained recursively by the formula, j > 2, 


Hj(x) = xHj_a(x) — Gi — 1)Hj-2(x), (2.15.3) 


where Ho(x) = 1 and A(x) = x. 

From this recursive relation one can prove by induction, that an even order poly- 
nomial A,,(x),m > 1, contains only terms with even powers of x, and an odd order 
polynomial, A,,+1(x), n => 0, contains only terms with odd powers of x. One can 
also show that 


i. H,(x)o(x)dx =0, forall j > 1. (2.15.4) 


Furthermore, one can prove the orthogonality property 


oS 0, fjAk 
/ Ayx)A(x)e@@)dx =). (2.15.5) 
nes Ji, wy=k. 


Thus, the system {H;(x), j = 0, 1, ...} of Chebychev—Hermite polynomials consti- 
tutes an orthogonal base for representing every continuous, integrable function f(x) 
as 


fx) = Do cj Hix), (2.15.6) 
j=0 
where, according to (2.15.5), 
= =f. Hy(x)f(x)dx, j>0. (2.15.7) 
J+ J—oo 
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In particular, if f(x) is a p.d.f. of an absolutely continuous distribution, having all 
moments, then, for all —co < x < ~w, 


f(x) =o) + Yo ej Hj(x)b(). (2.15.8) 


j=l 


Moreover, 
qQ= | sfeous =/1, 
af oo dx = : 1 
a=5 fo — Dfeddx = 5(u2 — D, 


1 
c3= 543 — 3444), 


etc. If X is a standardized random variable, i.e., 4; = 0 and 2 = 5 = 1, then its 
p.d.f. f(x) can be approximated by the formula 


1 1 
f(x) = G(x) + grate — 3x)p(x) + 54 44 — 3)(x* — 6x" + 3)6(x), (2.15.9) 


which involves the first four terms of the expansion (2.15.8). For the standardized 
sample mean W, = /n(X, — 14)/o, 


Bi 
,,= EW = 2.15.10 
M3 { at Jn ( ) 
and 
—3 
mosEwy=3e 2 (2.15.11) 
n 


where £ and > are the coefficients of skewness and kurtosis. 
The same type of approximation with additional terms is known as the Edgeworth 
expansion. The Edgeworth approximation to the c.d.-f. of W,, is 


P{W, <x} = O(x) Fas I¢o(x) 
(2.15.12) 
x|po—-3 4 ome 2 
; 7A (x 3)+ 75 10x° + 15) P(x). 


The remainder term in this approximation is of a smaller order of magnitude than 


1 
—, i.e., 0 { — }. One can obviously expand the distribution with additional terms to 
n n 


obtain a higher order of accuracy. Notice that the standard CLT can be proven by 
taking limits, as n — ov, of the two sides of (2.15.12). 


PART I: THEORY 149 


We conclude this section with the remark that Equation (2.15.9) could serve to 
approximate the p.d.f. of any standardized random variable, having a continuous, 
integrable p.d.f., provided the moments exist. 


2.15.2 Saddlepoint Approximation 


As before, let X1,..., X;, be i.i.d. random variables having a common density f(x). 
2 1x 
We wish to approximate the p.d.f of X, = -5 Xj. Let M(t) be the m.g.f. of ¢, 
n* 


i=1 
assumed to exist for all ¢ in (—o0, f), for some 0 < fg < oo. Let K(t) = log M(t) be 
the corresponding cumulants generating function. 
We construct a family of distributions F = { f(x, yw) : —co < W < fo} such that 


Sas) = f)exptyx — K()}. (2.15.13) 


The family F is called an exponential conjugate to f(x). Notice that f(x;0) = f(x), 
[o.e) 
and that f f(x; w)du(x) = 1 forall Y < f. 


—0o 
Using the inversion formula for Laplace transforms, one gets the relationship 


Fx W) = fx) - expin(yx — KW), (2.15.14) 


where f(x; w) denotes the p.d.f. of the sample mean of n i.i.d. random variables 
from f(x; yr). The p.d.f. fg(x; w) is now approximated by the expansion (2.15.9) with 
additional terms, and its modification for the standardized mean W,,. Accordingly, 


wy alt ee) pal) pW) 
fz, 005 W) = oo [1+ Mao An 7p ro) 
(2.15.15) 
= K® 
where $(z) is the p.d.f. of N(O, 1), z = se i p(w) = ange and 


path) = KO) /(KO(p)). Furthermore, u(y) = K'() and oh) = Kp). 
The objective is to approximate fx(x). According to (2.15.14) and (2.15.15), we 
approximate fx(x) by 


Fe) = fs W) expin[ KW) — WOO] 


~ Va ab) pach) 
~ o(y) afi+ 6A Oe 2 ran HO 0.15.16) 
+ BO | exp(a[K (fh) — px]}. 
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The approximation is called a saddlepoint approximation if we substitute in 
(2.15.16) w = w, where w is a point in (—oo, fo) that maximizes f(x; y). Thus, ~ 
is the root of the equation 


K'() =x. 


As we have seen in Section 2.14, K(w) is strictly convex in the interior of (—00, fo). 
Thus, K’(y) is strictly increasing in (—0o, fo). Thus, if vr exists then it is unique. 
Moreover, the value of z at y = is z = 0. It follows that the saddlepoint approxi- 
mation is 


J/ne 
(20 K2(yp))'/2 


1} ph) 5 5 1 
frei] . - Zeh|+0(3)]. 


The coefficient c is introduced on the right-hand side of (2.15.17) for normalization. 
A lower order approximation is given by the formula 


fx(x) = exp{n[K (wv) — px]}- 


(2.15.17) 


vic 
(Qn KO py)? 


fz(x) = exp{n[ K (yr) — Wx]}. (2.15.18) 


The saddelpoint approximation to the tail of the c.d.f., i.e., P{X, > &} is known to 
yield very accurate results. There is a famous Lugannani—Rice (1980) approximation 
to this tail probability. For additional reading, see Barndorff-Nielson and Cox (1979), 
Jensen (1995), Field and Ronchetti (1990), Reid (1988), and Skovgaard (1990). 
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Example 2.1. In this example we provide a few important results on the distributions 
of sums of independent random variables. 


A. Binomial 

If X, and X>, are independent, X,; ~ B(N,, 0), X2 ~ B(N2, 8), then X; + X2 ~ 
B(N, + No, 9). It is essential that the binomial distributions of X, and X> will have 
the same value of 6. The proof is obtained by multiplying the corresponding m.g.f.s. 


B. Poisson 
If X,~ P(A,) and X2~ P(Az) then, under independence, X, + X27 ~ 
P(A, +Az2). 
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C. Negative-Binomial 
If X; ~ NB(W, v1) and X2 ~ NB(w, v2) then, under independence, X; + X2 ~ 
NB(w, vy + v2). It is essential that the two distributions will depend on the same y. 


D. Gamma 

If X; ~ G(A, v}) and X2 ~ G(A, v2) then, under independence, X, + X2 ~ 
G(A, ¥; + v2). It is essential that the two values of the parameter A will be the 
same. In particular, 


xilil + xsbel ~ x7f1 + 2] 


for all vj, vy = 1,2,...; where x7 vil, i = 1, 2, denote two independent x?-random 
variables with v; and v2 degrees of freedom, respectively. This result has important 
applications in the theory of normal regression analysis. 


E. Normal 

If X; ~ N(w, 07) and X27 ~ N(po, 03) and if X; and X>2 are independent, then 
X,+ X.~ N(w + br, oO; + 03). A generalization of this result to the case of pos- 
sible dependence is given later. a 


Example 2.2. Using the theory of transformations, the following important result is 
derived. Let X; and X> be independent, 


X, ~~ G(A, v1) and Xz ~ G(A, v2), 


then the ratio R = X,/(X; + X2) has a beta distribution, 6(v;, v2), independent of 
A. Furthermore, R and T = X, + X2 are independent. Indeed, the joint p.d.f. of X1 
and X> is 


Vy +2 


f(X1, X2) = eookeae exp{—A(x, + X2)}, O< x1, x2 < &. 
1 2 


Consider the transformation 
X,;=X, 
T=X,+X>. 


The Jacobian of this transformation is J(x;, tf) = 1. The joint p.d.f. of X; and T is 
then 


Luite2 
gai.) = Foor yu x1)"-lexp{—At}, O<x1 <t<oo. 
1 2 
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We have seen in the previous example that T = X; + X2 ~ G(A, v; + v2). Thus, the 
marginal p.d.f. of T is 


vy+v2 
hos = —___ Pte. 08 f < 00) 
Pv} + v2) 


Making now the transformation 
t=t 
r=x,/t, 


we see that the Jacobian is J(r, t) = t. Hence, from (2.4.8) and (2.4.9) the joint p.d.f. 
of r andtis,forO <r <landO<t<o, 


ero= : rtd —ry!. ne). 
, By, v2) 
This proves that R ~ 6(v1, v2) and that R and T are independent. | 


Example 2.3. Let (X, A) be random variables, such that the conditional distribution 
of X given 4 is Poisson with p.d.f. 


x 


h 
p(Xsd) =e" —, x=0,1,... 
xX? 


and 4 ~ G(v, A). Hence, the marginal p.d.f. of X is 


A [o,e) 
p(x) = i yety-l edt) gy, 


T(v)x! Jo 
_P@t+yv) A’ sig 
~ Tiv)x! (1+ Ayre’ =0,1,.... 
=— ee _ T+) - 
Bebe rag Va fa OPO =F Ene. v)’w*. Thus, 


X ~ NB(W, v), and we get 


k 


A” a v-l1 ae —AA 
NBIK v= | d yie ape ne 


1=0 
But, 
k 1 


aA 
eet = 1- PIG e+) <A}. 
1=0 : 
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Hence, 
1 
NB(k3 W, v) = 1— P{G0,k+1)< qo v)} 
where G(1, k + 1) and G(1, v) are independent. 


d, v) 
. According to Example 2.2, 


Let R = ————_ 
G(1,k+1) 


2 G(L, v) _ 
~ Gd,k+1)+GQ,v) BO, k + 1). 


R 
But U ~ ——; hence, 
1+ 


NB(k3 fv, v) =1— P{R> A} 


ara 
= P{UU <1-W}=h_yv.k + D. 


=1-Plu> 
|| 


Example 2.4. Let X,,..., X, be i.i.d. random variables. Consider the linear and 
the quadratic functions 


sample mean: 
sample variance: 


We compute first the variance of S*. Notice first that S$? does not change its value if 
we substitute X; = X; — 4; for X; @ = 1,...,n). Thus, we can assume that 41 = 0 


and all moments are central moments. 
1 Z ss 
Write S* = a (So - “) Accordingly, 


V{S?} = a lv {> ui + n°V{X7} — 2ncov » x? v)| 
i=l =| 
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Now, since X,,..., X, are 1.i.d., 


i=1 


Vv ps ui = nV{X7} = n(u4 — 13). 


Also, 
melds) j-((Gr*)]) 
VIX ,=E oe -(E (- x) \ 
1 7 : 1 2)\? 
=<4E (: x) -5(#{(0%)) VY 
According to Table 2.3, 
(1)* = [4] + 4131] + 3127] + 6[217] + [141]. 
Thus, 


n 4 n 
(> x) =) X}+ 49° 30x} x) +390 0X? X3 
1 i=l 


i= if ii 
+69) 9) SOX?PX Xe t+ YOY YS YOK Xj) XEX!. 
pik ip itkAl 


Therefore, since 4; = 0, the independence implies that 


7 4 
E (x x) = nya + 3n(n — iB yiie 
i=l 


Also, 


Thus, 


: 1 
mPV{X7} = [nuts + 3n(n — 1s — 113] 


1 


= —[H4+ 2n — 13]. 
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At this stage we have to compute 


n n n 2 n 
cov (>: xe ) = SE (x x) (: x) —E ps ui Fixe 
i=l 1 i=l i=1 


i= 


From Table (2.3), (2)(1)? = [4] + 2[31] + [22] + [212]. Hence, 


n n 2 
E (> x) (> x) = npg tn(n— Lys 
i=l i=l 
and 


E {> x E{X?} = yb. 


i=1 


Therefore, 


n 
—2n cov (x xe “) = —2(4 — 13). 
i=l 
Finally, substituting these terms we obtain 


vs} = Ma— BS Mg t+ 3 
n—1 n(n — 1) 


Example 2.5. We develop now the formula for the covariance of X and S*. 


7 I n 7 n 
2) __ y 2 2 y ; 
cov(X, S ) = een tae = X; —nx ; a Xj 


1 
= ——— | cov 
n(n — 1) 


n n 
»— Xe, > X; | —n’cov(X?, X) 


i=l j=l 
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First, 


cov Sixx, =5° > cover? x) 
i=l j=l 


i=l j=l 


n 
= ¥ cov(X?, X;) = ns, 


i=l 


since the independence of X; and X, for all i ¢ j implies that cov(X?, Xj) = 0. 
Similarly, 


cov(X’, X) = E{X?} 


Thus, we obtain 
2. 3 1 
cov(X, S*) = —p3. 
n 


Finally, if the distribution function F(x) is symmetric about zero, “3 = 0, and 
cov(X, $7) = 0. a 


Example 2.6. The number of items, NV, demanded in a given store during one week is 
arandom variable having a Negative-Binomial distribution NB(y, v);0 < yw < land 
0 < v < oo. These items belong to k different classes. Let X = (X1,..., X,)/ denote 


a vector consisting of the number of items of each class demanded during the week. 
k 


These are random variables such that eX ; = N and the conditional distribution 
i=l 
of (X),..., Xz) given N is the multinomial M(N, 0), where 6 = (6), ..., Ox) is the 
k 


vector of probabilities; 0 < 6; < 1, y°e |; = 1. If we observe the X vectors over 
j=1 

many weeks and construct the Sigperional frequencies of the X values in the various 

classes, we obtain an empirical distribution of these vectors. Under the assumption 

that the model and its parameters remain the same over the weeks we can fit to 

that empirical distribution the theoretical marginal distribution of X. This marginal 

distribution is obtained in the following manner. 
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The m.g.f. of the conditional multinomial distribution of X* = (X,,..., X,_1)’ 


given N is 
k=l k=l iN 
Mx+wltt,..-5t-1) = [Soa + (1- a) ‘ 
i=l 


i=1 


Hence, the m.g.f. of the marginal distribution of X* is 


“. Tw+n) 
My(t,...,4-)=U—-¥) ) ————_ - 
2 Pola + I) 


k-1 k-1 a eee 
aera (1 Fa) =| = = 


i=1 


k-1 
1D 
i=l 
My+(ti,.--,%-) = a 2 > 


1- ) wy e" 
i=l 


where 


Gi 
ha, 
-¥(\-E9) 
i=l 


This proves that X* has the multivariate Negative-Binomial distribution. a 


b= Leisgkol 


Wi = 


Example 2.7. Consider a random variable X having a normal distribution, N(&, 07). 
Let ®(u) be the standard normal c.d.f. The transformed variable Y = ®(X) is of 
interest in various problems of statistical inference in the fields of reliability, quality 
control, biostatistics, and others. In this example we study the first two moments 
of Y. 

In the special case of § = 0 and o* = 1, since ®(u) is the c.d.f. of X, the above 
transformation yields a rectangular random variable, i.e., Y ~ R(O, 1). In this case, 
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obviously E{Y} = 1/2 and V{Y} = 1/12. In the general case, we have according to 
the law of the iterated expectation 
E{Y} = E{®(X)} 
= E{P{U < X | X}} 
= P{U < X}, 


where U ~ N(O, 1), U and X are independent. Moreover, according to (2.7.7), U — 
X ~ N(—é, 1+ 07). Therefore, 


E{Y} = &&/V¥(1 +07). 


In order to determine the variance of Y we observe first that, if U;, Uz are independent 
random variables identically distributed like N(O, 1), then P{U; < x,U, <x}= 
®7(x) for all —oo < x < oo. Thus, 
E{Y7} = E{*(X)} 
= P{U; —X <0,U,—X < 0}, 


where U,, U2 and X are independent and U; ~ N(0, 1),i = 1,2,U; — X and U2 — X 
have a joint bivariate normal distribution with mean vector (—£, —&) and covariance 


matrix 
v= (’ +07? 
ae l+o?} 
Hence, 
a 
ea G Pot a Fon 1 a) 
Finally, 


= 2 2 g ) 
V{Y} = E{Y?}-© (a 


o2)1/2 


Generally, the nth moment of Y can be determined by the n-variate multinormal 
g § 

c.d.f. ®, | ———.—,,..., ————> 

(a + o7)1/2 (1+ 07)1/2 

off-diagonal elements Rj; = o?/(1+07), for all k 4 j. We do not treat here the 

problem of computing the standard k-variate multinormal c.d.f. Computer routines 

are available for small values of k. The problem of the numerical evaluation is 


R), where the correlation matrix R has 
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generally difficult. Tables are available for the bivariate and the trivariate cases. For 
further comments on this issue see Johnson and Kotz (1972, pp. 83-132). a 


Example 2.8. Let X), X2,..., X, be iid. N(O, 1) r.v.s. The sample variance is 
defined as 


S= 


1 25 . il 

a etree , Where X = foe 

Let Q = X(X; — X)*. Define the matrix J = 11’, where 1’ = (1,..., 1) is a vector 

of ones. Let A = J — —J, and Q = X’ AX. It is easy to verify that A is an idempotent 
n 


matrix. Indeed, 


n 


Le 2 Des 1 
I-—-j) =! J+tGVsl J. 
n n n 
The rank of A is r =n — 1. Thus, we obtained that S* ~ + x[n — 1]. | 


Example 2.9. Let X,,..., X, be i.id. random variables having a N(é, a”) dis- 


Pes fo 
tribution. The sample mean is X = ~S°X; and the sample variance is S* = 
n 


i=1 


[oe if 
px i —-X i In Section 2.5 we showed that if the distribution of the Xs 
— 

i=l 


is symmetric, then X and S? are uncorrelated. We prove here the stronger result that, 
in the normal case, X and S? are independent. Indeed, 


¥~es lu, where U=(U,..., Up) 


2 
is distributed like N(0, I). Moreover, S? ~ vl — 1/)U. But, 
= 


1 
1 (1 1) = 
n 


This implies the independence of X and S?. a 


Example 2.10. Let X be a k-dimensional random vector having a multinormal 
distribution N(AB, o2] ), where A isak x r matrix of constants, B isanr x 1 vector; 
1<r<k,0<o* < oo. We further assume that rank (A) = r, and the parameter 
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vector B is unknown. Consider the vector B that minimizes the squared-norm ||X — 
AB||*, where ||X||? = Sx Such a vector B is called the least-squares estimate of 
B. The vector B is determined so that 
XII? = |X — ABI? + ||ABI? = Q1 + Q>. 

That is, AB is the orthogonal projection of X on the subspace generated by the 
column vectors of A. Thus, the inner product of (X — AB) and AB should be zero. 
This implies that 


B= (AA) 'A'X. 


The matrix A’A is nonsingular, since A is of full rank. Substituting B inthe expressions 
for Q; and Qo, we obtain 


QO = ||X— AB||? = XU — A(A’AY'AYX, 
and 
Q> = ||AB||? = X’A(A’A)!A’X. 
We prove now that these quadratic forms are independent. Both 
B, =I — A(A'A)"!A’ and By = A(A’A)"!A’ 
are idempotent. The rank of B, is k — r and that of B, is r. Moreover, 


B By = (1 — A(A‘A)!A‘)A(A‘A) 1A’ 
= A(A‘'A)!A’ — A(A‘A)!A’ = 0. 


Thus, the conditions of Theorem 2.9.2 are satisfied and Q, is independent of Q2. 
Moreover, Q; ~ o*x7[x —r;4,] and Q2 ~ ox? [rs Aa] where 


1 
M= 5B AU — A(A'A)'A')AB =0 
and 
1 ral / —-lar 
= 5B A'A(A'A) A’ AB 


1 a 
= 58 (AA)B. 
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Example 2.11. Let X,,..., X, bei.i.d. random variables from a rectangular R(0, 1) 
distribution. The density of the ith order statistic is then 


=< 1 i-1 ni 
iO) = gqqcap OP 


0 < x < 1. The p.d-f. of the sample median, for n = 2m + 1, is in this case 


1 


x™1—-xy", O<x<l. 
Bim+1,m-+1) 


&m(X) = 
The p.d.f. of the sample range is the 6(m — 1, 2) density 


1 
h,(r) = wens | , =r 
(r) Bund’ (=r) <r< 


These results can be applied to test whether a sample of n observation is a realization 
of iid. random variables having a specified continuous distribution, F(x), since 
Y = F(Y) ~ R(O, 1). a 


Example 2.12. Let X,, X2,..., X, be i.i.d. random variables having a common 
exponential distribution E(A),0 < A < oo. Let X(1) < XQ) < -+- < X(q be the cor- 
responding order statistic. The density of X(1) is 


fx sa) = Ane*"*, x > 0. 
The joint density of X(;) and Xa) is 
yn Xo (Xs Y=Hn(n—1dePOMe AO, Qex <x y. 
Px), X 0% Y y 
Let U = X(2) — X1). The joint density of X(1) and U is 
Xy.u(%, U) = Ane*™ (n — Irae?" Q<x <—, 
f. ()> 


and 0 < u < oo. Notice that fx, u(x, u) = fx): fu(u). Thus Xq) and U are 

independent, and U is distributed like the minimum of (7 — 1) iid. E(A) ran- 

dom variables. Similarly, by induction on k = 2,3,...,n, if Ug = XK) — X«K-1) 

then X(x-1) and U, are independent and U, ~ E(A(n — k + 1)). Thus, since Xq) = 

Xay+U2t---+U, E{Xw} = y: * and ViX@} = + » = ir 
j=n—k+1 j=n—k+1 

allk > 1. | 
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Example 2.13. Let X ~ N(u, a”), —00 < p< 00, 0 <0” <0. The p.d.f. of 
X is 


. 2 1 1 2 
f(x; M,o )= QQr)!2o exp IG2 (x 7) 


1 we UL ee 
= Gmite 72 “exp aut 552 Be oe 


Pe GS =, WGee ers ae Pasi: WG BO 
a ; ram , (LL, 52? 
1 

Wo(, 07) = — 552? U\(x) =x, and U(x) =x?. We can write FCs a7) 
Oo 


as a two-parameter exponential type family. By making the reparametriza- 
tion (w,o7) > (WW, W2), the parameter space © = {(y, o*):-0O <p <O, 
0 < 0” < ov} is transformed to the parameter space 


Q = {(W1, W2) 1-00 < Pi < C0, —00 < fo < O}. 
In terms of (y1, 2) the density of X can be written as 
Fas Wi, W2) = A&A, Wa) explWix + Y2x7}, 00 < x < 00, 


where h(x) = 1/,/7 and 


Ain, va) = exp [+ “+ 2 tog(—vn) 

; = exp ;— — + = log(— ‘ 

1, W2 p 4 yy | 2 g 2 

The p.d.f. of the standard normal distribution is obtained by substituting w, = 0, 
1 

r= 5. a 


Example 2.14. A simple example of a curved exponential family is 
F = {N(é, c&”), —00 < — < 00,c > 0 known}. 


In this case, 


| 
fw, Ww) = eh vr) exp{tix + vox}, 


with yw = -5 02. wy, and w2 are linearly independent. The rank is k = 2 but 


2 = {Ni Wa): va = —S WR, 00 < th < 00}. 
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The dimension of 2* is 1. 
The following example shows a more interesting case of a regular exponential 
family of order k = 3. a 


Example 2.15. We consider here a model that is well known as the Model II of 
Analysis of Variance. This model will be discussed later in relation to the problem 
of estimation and testing variance components. 

We are given n-k observations on random variables Xj; (i =1,...,k3j = 
1,...,). These random variables represent the results of an experiment performed 
in k blocks, each block containing n trials. In addition to the random component rep- 
resenting the experimental error, which affects the observations independently, there 
is also a random effect of the blocks. This block effect is the same on all the observa- 
tions within a block, but is independent from one block to another. Accordingly, our 
model is 


Xij ~ UG + ij, el ene jJ=l,...,n 


where e;; are i.i.d. like N(O, a”) and q; are iid. like N(0, t7). 
We determine now the joint p.df. of the vector X=(Xj,..., 


Xin, X21, +++, X2n,+++, Xki,+++,» Xkn)’. The conditional distribution of X given 
a=(a),...,a,) is the multinormal N(w1,, + &(a),o71,,), where &'(a) = 
(4,1), a1), ..., ax li). Hence, the marginal distribution of X is the multinormal 


N(El,x, V), where the covariance matrix V is given by a matrix composed of k equal 
submatrices along the main diagonal and zeros elsewhere. That is, if J, = 1,1/, is an 
n X n matrix of 1s, 


V =diag{o7I, + t7Jn,..., 07° In HT? In}. 


The determinant of V is (o”)*" |, + pJ,|*, where p = t”/o*. Moreover, let H be an 


orthogonal matrix whose first row vector is —=1/. Then, 
n 


Jn 
Un + In| = |HUn + eJn)H'| = (1+ np). 


Hence, |V| = o7”*(1 +. np)‘. The inverse of V is 
as : 1 -1 1 -l 
Vv = diag Un + pJn) prey Un + pJn) , 
o o 


where (J, + pJn) | = 1, - (0/0 =F np))Jn. 
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Accordingly, the joint p.d.f. of X is 


S(& b, a, Tt”) 


1 1 pat 
= ara | gx Hn V (x sata} 


1 
~ (Qa yen 2 kn] sh. np)k/2 is 


1 ! 
p V2 (x MAK) (x =. LAK) 


p 
202(1 + np) 


(x HAng) diag{Jn, rr) In} (X ah utn)}. 


Furthermore, 


k n 
(x — Wg) (K — WAnK) = ye Si — py 


i=1 j=l 


n k 
Yo Gij — HY tn DOG; - Wy, 


j=l i=l 


k 
i=1 


1 n 
where x; = — ij, =1,...,k. Similarly, 
x no i y. 


k 
(x — wAnx)! diag In... .. Inf — Wn) = 0? YO — pw). 


i=1 
Substituting these terms we obtain, 


1 
= (Qa kn 24g nk] + np)k/2 : 


Sf % Li, cae ee 


“ (a; — 3? me _ yy 
Xj x xX ’ 
2021 + np) — w+npy 


where 
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Define, 


n 


k 
U@=>) > x, U=) 3, Uae) =2, 
i j i=1 


i=l j=l 
and make the reparametrization 


1 —n’p nkuw(1 + 2np) 
Wi ——. 2? Wo = 2 ’ W3 — 2 — 
oO 207(1 + np) 20°(1 + np) 


The joint p.d.f. of X can be expressed then as 


SG Wi, Wo, W3) = exp{—nK (W)} - exp(Wi U(x) + W2U2(x) + 3U3(x)}. 


The functions U; (x), U2(x), and U3(x) as well as ¥%(0), W2(0), and w3(@) are linearly 
independent. Hence, the order is k = 3, and the dimension of Q2* is d = 3. | 


Example 2.16. Let X1,..., X,, be i.i.d. random variables having a common gamma 
distribution G(A, v), 0 < 4, v < oo. For this distribution 6; = 2v and B2 = 6v. 
= 1 
The sample mean X,, is distributed like —G(A, nv). The standardized mean is 
2 n 
AXn — : 
W, = J/n aie eed The exact c.d.f. of W,, is 
fv 


P{W, <x} = P{G(, nv) <nv+xJny}. 


On the other hand, the Edgeworth approximation is 


P{W, <x} = (x) (x? — 1)(x) 


v 
3./n 
x |= ee 3) 4 v 

8 18 


(x* — 10x? + 15) (x). 


n 


In the following table, we compare the exact distribution of W,, with its Edgeworth 
expansion for the case of v = 1, n = 10, and n = 20. We see that for n = 20 the 
Edgeworth expansion yields a very good approximation, with a maximal relative 
error of —4.5% at x = —2. Atx = —1.5 the relative error is 0.9%. At all other values 
of x the relative error is much smaller. 
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Exact Distribution and Edgeworth Approximation 


n= 10 n= 20 

x Exact Edgeworth Exact Edgeworth 
—2.0 0.004635 0.001627 0.009767 0.009328 
—1.5 0.042115 0.045289 0.051139 0.051603 
—1.0 0.153437 0.160672 0.156254 0.156638 
—0.5 0.336526 0.342605 0.328299 =: 0.328311 
0 0.542070 0.542052 0.529743 = 0.529735 
5 0.719103 0.713061 ~—- 0.711048 ~—0.711052 
1.0 0.844642 0.839328 0.843086 0.843361 
15 0.921395 0.920579 =0.923890 =: 0.924262 
2.0 0.963145 0.964226 0.966650 0.966527 


= 1 
Example 2.17. Let X,,..., X, be ii.d. distributed as G(A, v). X, ~ —G(A, nv). 
n 


Accordingly, 


fz, 00) = 


(ndy"” 
T(nv) 


The cumulant generating function of G(A, v) is 


nv—-1 eax 


, x>0. 


K(w)= -v tog (1 _ ) » wed. 
Thus, 
KW) =— a 
Was. ws 

and 

Gj —— a 

(W)= Cae w<a. 
Accordingly, yy = 4 — v/x and 
K"(h) = x?/v 


exp{n[ K (Wr) _ wx]} = exp{nv — ndax}- (~) . It follows from 
v 


(2.15.18) that the saddlepoint approximation is 


fx(x) = 


Vv 


viv a a e””. vl eonax 
V2 


Equation 
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If we substitute in the exact formula the Stirling approximation, (nv) = 
1 : : : : 
V 2m (nv)"""2ze—"”, we obtain the saddlepoint approximation. a 
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Section 2.2 


2.2.1 


2.2.2 


2.2.3 


2.2.4 


2.2.5 


2.2.6 


2.2.7 


Consider the binomial distribution with parameters n, 0,0 <0 <1. 
Write an algorithm for the computation of b(j | n,@) employing the 
recursive relationship 


(1 — 6)" j=0 
b(Uj3n, 8) = 
R,(n, O)b(j —1;n, 0), j=Hl,...,N 
where Rj(n, 0) = b(j;n, 0)/b(j — 1;n, @). Write the ratio R j(n, 6) explic- 
itly and find an expression for the mode of the distribution, 1.e., <P = smallest 
nonnegative integer for which b(x°:n, 6) > b(j;n, 9) for all 7 =0,...,n. 


Prove formula (2.2.2). 
Determine the median of the binomial distribution withn = 15 and@ = .75. 


Prove that when n > oo, 06 > 0, but nd > 4,0 < A < ~, then 


lim b(i;n,0)= p(i:a), i=0,1,... 
n—> oo 
n@—>d 


where p(i; A) is the p.d.f. of the Poisson distribution. 
Establish formula (2.2.7). 


Let X have the Pascal distribution with parameters v (fixed positive integer) 
and 6, 0 < @ < 1. Employ the relationship between the Pascal distribution 
and the negative-binomial distribution to show that the median of X is 
v +n5, where n.5 = least nonnegative integer n such that Ig(v,n + 1) => .5. 
[This formula of the median is useful for writing a computer program and 
utilizing the computer’s library subroutine function that computes Jg(a, b).] 


Apply formula (2.2.4) to prove the binomial c.d.f. B(j;n, 0) is a decreasing 
function of 0, for each j = 0,1,...,n. 
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Apply formula (2.2.12) to prove that the c.d.f. of the negative-binomial 
distribution, NB(w, v), is strictly decreasing in w, for a fixed v, for each 
j =0,1,.... 


Let X ~ B(10°, .0003). Apply the Poisson approximation to compute 
P{20 < X < 40}. 


Section 2.3 


2.3.1 


2.3.2 


2.3.3 


2.3.4 


2.3.5 


2.3.6 


2.3.7 


2.3.8 


Let U be a random variable having a rectangular distribution R(O, 1). Let 
B-\(p | a,b), 0 < p < 1, 0 < a,b < & denote the pth quantile of the 
B(a, b) distribution. What is the distribution of Y = B~!(U;a, b)? 


1 
Let X have a gamma distribution G (5 ‘), 0 < B < ow, andk bea positive 


integer. Let x [v] denote the pth quantile of the chi-squared distribution with 
1 

v degrees of freedom. Express the pth quantile of G (5 ‘) in terms of the 

corresponding quantiles of the chi-squared distributions. 


Let Y have the extreme value distribution (2.3.19). Derive formulae for the 
pth quantile of Y and for its interquartile range. 


Let n(x; &, 07) denote the p.d.f. of the normal distribution N(é, a”). Prove 
that 


CO 
if n(x;£,0?)dx = 1, 
—0o 
for all (E, 07); -7w0 <&é<w,0< a2 <a. 


Let X have the binomial distribution with n = 10° and 6 = 1073. For large 
values of A (A > 30), the N(A, A) distribution provides a good approxima- 
tion to the c.d.f. of the Poisson distribution P(A). Apply this property to 
approximate the probability P{90 < X < 110}. 


Let X have an exponential distribution F(A), 0 < A < oo. Prove that for all 
t > 0, E{exp{—tX}} > exp{—t/A}. 


Let X ~ R(O, 1) and Y = —log xX. 
(i) Show that E{Y} > log 2. [The logarithm is on the e base. ] 
(ii) Derive the distribution of Y and find E{Y} exactly. 


Determine the first four cumulants of the gamma distribution G(A, v),0 < A, 
v < oo. What are the coefficients of skewness and kurtosis? 
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2.3.9 Derive the coefficients of skewness and kurtosis of the log-normal distribu- 
tion LN(u, co). 


2.3.10 Derive the coefficients of skewness and kurtosis of the beta distribution 


B(p, 9). 


Section 2.4 
2.4.1 Let X and Y be independent random variables and P{Y > 0} = 1. Assume 
1 
also that E{|X|} < oo and E {>| < oo. Apply the Jensen inequality and 


the law of the iterated expectation to prove that 


E {>} > E{X}/E(Y), if E(X) > 0, 


< E{X}/E{Y}, if E{X} <0. 


2.4.2 Prove that if X and Y are positive random variables and E{Y | X} = bX, 
0 < b < ~, then 


(i) E{Y/X} =b, 
(ii) E{X/Y} > 1/b. 


2.4.3 Let X and Y be independent random variables. Show that cov(X + Y, X — 
Y) = Oif and only if V{x} = V{Y}. 


2.4.4 Let X and Y be independent random variables having a common normal 
distribution N(0, 1). Find the distribution of R = X/Y. Does E{R} exist? 


2.4.5 Let X and Y be independent random variables having a common log-normal 
distribution LN (1, 07), ie., log X ~ logY ~ N(u, 0”). 
(i) Prove that XY ~ LN(2p, 207). 
(ii) Show that E{XY} = exp{2u + 07}. 
(iii) What is E{X/Y}? 


2.4.6 Let X have a binomial distribution B(n, 8) and let U ~ R(O, 1) indepen- 
dently of X and Y = X+U. 


(i) Show that Y has an absolutely continuous distribution with c.d_f. 


0, ifn <0 

Fine (7 — f)BU3n, @)+ ifj<n<Jjtl, 

WY) d—-7t+pBU—-lin,d), jf =0,1,...,n° 
1, ify7>n+l 


(ii) What are E{Y} and V{Y}? 
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2.4.9 


2.4.10 


2.4.11 


2.4.12 


2.4.13 
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Suppose that the conditional distribution of X given @ is the binomial B(n, 0). 
Furthermore, assume that 0 is a random variable having a beta distribution 
B(D, q). 

(i) What is the marginal p.d.f. of X? 


(ii) What is the conditional p.d-f. of 9 given X = x? 


Prove that if the conditional distribution of X given 4 is the Poisson P(A), and 


1 
if A has the gamma distribution as G @ ») , then the marginal distribution 
T 
of X is the negative-binomial NB (i. ), 
1+tT 


Let X and Y be independent random variables having acommon exponential 

distribution, E(A). Let U = X ++ YandW=X —Y. 

(i) Prove that the conditional distribution of W given U is the rectangular 
R(-U, UV). 

(ii) Prove that the marginal distribution of W is the Laplace distribution, 
with p.d_f. 


Xr 
g(w;A) = 5 exp{—Ala|}, —-co <@w< oo. 


Let X have a standard normal distribution as N(0, 1). Let Y = ®(X). Show 
that the correlation between X and Y is p = (3/ m)\/?, [Although Y is com- 
pletely determined by X, i.e., V{Y | X} = 0 for all X, the correlation p is 
less than 1. This is due to the fact that Y is a nonlinear function of X.] 


Let X and Y be independent standard normal random variables. What is the 
distribution of the distance of (X, Y) from the origin? 


Let X have a x7[v] distribution. Let Y = 5X, where 1 < 5 < oo. Express the 


m.g.f. of Y as a weighted average of the m.g.f.s of x?[v + 2j], 7 =0,1,..., 
1 

with weights given by w; = P[J = j], where J ~ NB (1 sore >). [The 

distribution of Y = 5X can be considered as an infinite mixture of x7[v + 

2J] distributions, where J is a random variable having a negative-binomial 


distribution. ] 


Let X and Y be independent random variables; X ~ x[v,] and Y ~ 

5x7[v2], 1 < 5 < 00. Use the result of the previous exercise to prove 
1 

that X + Y ~ x?[v; + vo +2J], where J ~ NB (1 — 2). [Hint: Mul- 

tiply the m.g.f.s of X and Y or consider the conditional distribution of 


X+Y given J, where J is independent of X and Y | J ~ x7 [v2 +2/], 


1 V2 
J ~NB(1—-,—).] 
6.2 
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2.4.14 Let X,,...,X, be identically distributed independent random variables 
having a continuous distribution F symmetric around jz. Let M(X),..., Xn) 
and Q(X,,..., X,) be two functions of X = (X),..., X,)' satisfying: 

(i) E{M(X,, tee Xn)} =Bh, 

(ii) E{M7(X1,..., Xn)} < 00, E{Q?(X1,..., Xn} < 

(iii) M(—X],..., —Xn) = —M(X),..., Xn); 

(iv) M(X, +¢,...,X,+c)=c+M(X,,...,X,) for all constants c, 
—0O <C < &; 

(v) O(X, +c¢,..., X, +c) = O(X1,..., X,), all constants c, —oo < c < 
CO; 

(vi) Q(-X1,..., -—Xn) = Q(X, ..-, Xn); 

then cov(M(X,,..., Xn), O(X1,..., Xn)) = 0. 


k 
2415 LetY%,..., ¥e-1 (k > 3),¥; = 0, Pes < 1, have a joint Dirichlet distribu- 
i=l 
tion D(y, v2,...,%),0< v3 < wO,i=1,...,k. 


(i) Find the correlation between Y; and Y;, i 4 1’. 
(ii) Let 1 < m < k. Show that 


k 
(M1, ..-,%n-1) ~ D Vi,--+, Vm-1; ) vj 


j=m 
(iii) For any 1 < m < k, show the conditional law 
k-1 
(Way -661¥nat | Ys e+es Yea) ~ | 1 DOV; | DOr. +s Vat Me): 
j=m 


Section 2.5 


1 n 
2.5.1 Let X,,..., X, be i.i.d. random variables. fi, = —S Xx! is the sample 
n 


i=l 
moment of order 7. Assuming that all the required moments of the common 
distribution of X; (i = 1,..., 7) exist, find 


(i) E{f,}; 
(ii) V{a,}; 
iii) cov(fi,,, fir,), for 1 <r) <r. 


2.5.2 LetU;, j =0,+1, +2, ..., be a sequence of independent random variables, 
such that E{U;}=0 and V{U;} =o” for all j. Define the random 
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variables X, = Bo + BiUi-1 + BoUi-2 + +++ + BoUi_p + U;, 
t=0,1,2,..., where Bo, ..., B, are fixed constants. Derive 

@) E{X,},t=0,1,..5 

(ii) V{X;},t =0,1,...; 
(ili) cov(X,, X+4;,),t = 0,1, ...; 4 is a fixed positive integer. 

[The sequence {X;} is called an autoregressive time series of order p. 

Notice that E{X,}, V{X;}, and cov(X,, X;4;,) do not depend on t. Such a 
series is therefore called covariance stationary.] 


2.5.3 Let X,,..., X, be random variables represented by the model X; = pu; + e; 
(i = 1,...,7), where e;,..., e, are independent random variables, E {e;} = 
O and V{e;} =o” for all i =1,...,n. Furthermore, let ft, be a con- 
stant and uw; = wi-1 + Jj G@ = 1,2,...,), where Jo,..., J, are indepen- 
dent random variables, J; ~ B(1, p),i = 2,...,n.LetX = (X),..., Xn)’. 
Determine 
(i) E{X}, 
(ii) YX). [The covariance matrix] 


2.5.4 Let X),..., X, bei.i.d. random variables. Assume that all required moments 
exist. Find 


(i) E{X+}. 
(ii) E{X°}. 
(iii) E{(X — 1°}. 


Section 2.6 
2.6.1 Let (X 1, X2, X3) have the trinomial distribution with parameters n = 20, 
6, = 3,6, = 6,63 = .1. 
(i) Determine the joint p.d.f. of (X2, X3). 
(ii) Determine the conditional p.d.f. of X; given X2 = 5, X3 = 7. 


2.6.2 Let (X1,..., X,) have a conditional multinomial distribution given N with 
parameters NV, 0;,...,0,. Assume that N has a Poisson distribution P(A). 


(i) Find the (non-conditional) joint distribution of (X),..., Xn). 
(ii) What is the correlation of X, and X,? 


2.6.3 Let (X,, X2) have the bivariate negative-binomial distribution NB(6,, 92, v). 
(i) Determine the correlation coefficient po. 
(ii) Determine the conditional expectation E{X, | X2} and the conditional 
variance V{X, | X2}. 
(iii) Compute the coefficient of determination D* = 1 — E{V{X, | 
X2}}/V{X 1} and compare D* to p. 
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2.6.4 Suppose that (X,,..., X;,) has a k-variate hypergeometric distribution 
A(N, M,..., My, n). Determine the expected value and variance of Y = 
k 


BX; where 6; (j = 1, ...,k) are arbitrary constants. 
j=l 


2.6.5 Suppose that X has a k-variate hypergeometric distribution H(N, 
M,,..., My,n). Furthermore, assume that (Mj,..., M;,) has a multino- 
mial distribution with parameters N and (6), ..., 0;). Derive the (marginal, 
or expected) distribution of X. 


Section 2.7 


2.7.1 Let (X, Y) have a bivariate normal distribution with mean vector (€, 7) and 


covariance matrix 
x o* pot 
pot T 2 


Make a linear transformation (X, Y) — (U, W) such that (U, W) are inde- 
pendent N(0, 1). 


2.7.2 Let (X1, X2) have a bivariate normal distribution N(é, £). Define Y = 
aX, + BX. 
(i) Derive the formula of E{®(Y)}. 
(ii) What is V{®(Y)}? 


2.7.3. The following is a normal regression model discussed, more generally, in 


Chapter 5. (x1, Y1),.--, (%n, Yn) are nm pairs in which x1,...,%, are pre- 
assigned constants and Y;,..., Y, independent random variables. Accord- 


1 n 
ing to the model, ¥; ~ N(w + Bx;,07), i=1,...,n. Let = —) x, 
; ° kre 
Y, = pa and 
i= 


Ba = "Gi — Dif DG: — 27, On = Yn — Bain. 

i=l i=l 
@, and f, are called the least-squares estimates of a and £, respectively. 
Derive the joint distribution of (@,,, Bn). [We assume that D(x; — *)? > 0.] 


2.7.4 Suppose that X is an m-dimensional random vector and Y is an r-dimensional 
one, | < r < m. Furthermore, assume that the conditional distribution of X 
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2.7.5 


2.7.6 


2.7.7 


2.7.8 
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given Y is N(AY, ¥) where A is anm x r matrix of constants. In addition, 
let Y ~ N(y, D). 

(i) Determine the (marginal) joint distribution of X; 

(ii) Determine the conditional distribution of Y given X. 


Let (Z,, Z2) have a standard bivariate normal distribution with coefficient 
of correlation p. Suppose that Z; and Z> are unobservable and that the 
observable random variables are 


Let t be the coefficient of correlation between X; and X>. Prove that p = 
sin(zt/2). [t is called the tetrachoric (four-entry) correlation.] 


Let (Z,, Z2) have a standard bivariate normal distribution with coefficient 
of correlation p = 1/2. Prove that P{Z, > 0, Z. < 0} = 1/6. 


Let (Z;, Z2, Z3) have a standard trivariate normal distribution with a corre- 
lation matrix 


1 pi p13 
R=] pr 1 px 
P3 p23 = 


Consider the linear regression of Z; on Z3, namely Z= = 013Z3 and the 
linear regression of L on Z3, namely = = (23Z3. Show that the correla- 
tion between Z, — Z } and Z, — 75 is the partial correlation 012.3 given by 
(2.7.15). 


Let (Z,, Z2) have a standard bivariate normal distribution with coefficient of 
correlation p. Let ,; = E{Z}Z5} denote the mixed moment of order (7, 5). 
Show that 
(i) 12 = M21 = 0; 
(ii) Wi3 = M31 = 303 
(iii) pax. = 1 +297; 
(iv) fig = fai = 0; 
(Vv) His = Ms1 = 15p; 
(vi) p24 = fan = 3(1 + 49°); 
(vii) 433 = 3(3 + 207). 
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2.7.9 


2.7.10 


2.7.11 


2.7.12 


A commonly tabulated function for the standard bivariate normal distribution 
is the upper orthant probability 


Lth,k | p)= P{h< Z),k < Zy} 


= Sas. exp | — et — Bocca +2} derden 
Show that 
(i) Bo(h, ks p) = 1 — L(h, —co | p) — L(—ov, k | p) + Lh, k | p); 
(ii) L(h,k | p) = Lik, h | p); 
(iii) L(—h,k | p) + Lih,k | —p) = 1 — PK); 
(iv) L(—h, —k | p)— Lh, k | p) = 1— O(h) — &(k); 


1 1 
W) LOO Te) = at Z sin”!(p). 


Suppose that X has a multinormal distribution N(&, XZ). Let Xi) = 


max { X;}. Express the c.d.f. of X(,) in terms of the standard multinormal 
<1<n 


c.d.f., ®,(Z; R). 


Let Z = (Z,,..., Zn) have a standard multinormal distribution whose cor- 
relation matrix R has elements 


1, ifi=j 
Oe Ay, ii AT 


where |A;| < 1@ =1,...,m). Prove that 


- = hj—Aj 
enh) =f ow] [oY ) aw, 
oem 


/ 2 
IeseNG 


where $(u) is the standard normal p.d.f. [Hint: Let Up, U1, ..., Um be inde- 
pendent N(O, 1) and let Z; = AjUo + ,/@—A5)Uj, j =1,...,m] 


Let Z have a standard m-dimensional multinormal distribution with a cor- 
relation matrix R whose off-diagonal elements are p, 0 < p < 1. Show 
that 


(oe) 


®,, (0; R) = = mail -4 ®(at)]"dt, 


—0O 


where 0<a<oo, a? =2p/(1—p). In particular, for p = 1/2, 
®,,(0; R)=UA+ oo ae 
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Section 2.8 
2.8.1 Let X ~ N(O,o7) and Q = X?. Prove that Q ~ o*x7[1] by deriving the 
m.g.f. of Q. 


2.8.2 Consider the normal regression model (Problem 3, Section 2.7). The sum of 
squares of deviation around the fitted regression line is 


Ovx =) 0% —a@- px =A-P) OM - PY, 
i=l i=l 
where r is the sample coefficient of correlation, i.e., 
n n n 1/2 
r=) YG - »/ ba =a) DG = rs] : 
i=l i=l i=l 
Prove that Qy|x ~ o*x?[n — 2). 


2.8.3 Let {¥jj;i =1,...,/, 7 =1,..., J} bea set of random variables. Consider 
the following two models (of ANOVA, discussed in Section 4.6.2). 


Model I: ¥;; are mutually independent, and for each i GG = 1,..., I)¥ij ~ 
N&, o”) for all j =1,...,J/.&1,..., & are constants. 


Model II: Foreachi (i = 1, ..., /) the conditional distribution of Y;; given é; 
is N(&;, a”) forall j = 1,..., J. Furthermore, given &|, ..., €7, Y;; are con- 
ditionally independent. &), ..., &; are independent random variables having 


the common distribution N(0, t7). Define the quadratic forms 


Determine the distributions of Q; and Q> under the two different models. 


2.8.4 Prove that if X, and X> are independent and X; ~ x7[v;;A;] i = 1, 2, then 
X, + Xo ~ x7 [vi + 25a, + Ag). 


Section 2.9 


2.9.1 Consider the statistics Q, and Q> of Problem 3, Section 2.8. Check whether 
they are independent. 
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2.9.2 Consider Example 2.5. Prove that the least-squares estimator B is indepen- 
dent of Q). 


2.9.3 Let (X1, Y\),..., (Xn, Y,) be independent random vectors having a common 
bivariate normal distribution with V{X} = V{Y} = 1. Let 


=-—, {Le XY” — 2p ee RY; — Y) + pP DY — | 


i=1 


Or = 51k - py 
ope p > 


where p, —1 < p < 1,is the correlation between X and Y. Prove that Q, and 
Q» are independent. [Hint: Consider the random variables U; = X; — pY;, 
i=1,...,n.]J 


2.9.4 Let X be an n x1 random vector having a multinormal distribution 
N(w1, XY) where 


t=o7| =o7*(1—p)I +0’pJ, 
Ps 28 1 


J = 11’. Prove that X = Da and Q = Da - XY are independent 
=1 

and find their distribution. (Hint: Apply the Helmert orthogonal transforma- 

tion Y = HX, where H is ann x n orthogonal matrix with first row vector 


1 
equal to —=1’.] 


Ja 


Section 2.10 

2.10.1 Let X,,..., X, be independent random variables having a common expo- 
nential distribution E(A), 0 < A < oo. Let X(1) < +--+ < Xin) be the order 
Statistics. 


(i) Derive the p.d.f. of X(1). 
(ii) Derive the p.d.f. of X(n). 
(iii) Derive the joint p.d-f. of (X(1), X(n)). 
(iv) Derive the formula for the coefficient of correlation between X 1) and 
Xn): 
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2.10.3 


2.10.4 


2.10.5 


2.10.6 


2.10.7 
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Let X,,..., X, be independent random variables having an identical con- 
tinuous distribution F(x). Let Xj) < --- < Xm) be the order statistics. Find 
the distribution of U = (F(X(ny) = F(X 2y))/(F (Xa) _ F(X1))). 


Derive the p.d.f. of the range R = X(,) — X(1) of a sample of n = 3 inde- 
pendent random variables from a common N (i, o”) distribution. 


Let X,,..., X,, where n = 2m + 1, be independent random variables hav- 
ing a common rectangular distribution R(O, 6), 0 < 6 < oo. Define the 
statistics U = X¢m) — Xqy and W = Xi) — X(m41). Find the joint p.d-f. of 
(U, W) and their coefficient of correlation. 


Let X,,..., X, be iid. random variables having a common continuous 
distribution symmetric about x9 = wu. Let f()(x) denote the p.d.f. of the ith 
order statistic, i = 1,...,n. Show that fi-)(u +x) = fiXr+r(u — x), all 
5 an el ee 


Let Xi) be the maximum of a sample of size n of independent identi- 
cally distributed random variables having a standard exponential distribu- 
tion E(1). Show that the c.d.f. of Y, = Xin) — logn converges, asin — ov, to 
exp{—e *}, which is the extreme-value distribution of Type I (Section 2.3.4). 
[This result can be generalized to other distributions too. Under some gen- 
eral conditions on the distribution of X, the c.d.f. of X(,) + login converges 
to the extreme-value distribution of Type I (Galambos, 1978.) 


Suppose that X,1,..., Xn, are k independent identically distributed ran- 


dom variables having the distribution of the maximum of a random sample 
k 


of size n from R(O, 1). Let V = ] [xnv. Show that the p.d.f. of V is (David, 


i=1 


1970, p. 22) 


nk 1 k-1 
ev) = —v" (-logvy*, O<v<l. 


Section 2.11 


2.11.1 


2.11.2 


Let X ~ t[10]. Determine the value of the coefficient of kurtosis y = 
* «2 
My/by- 


Consider the normal regression model (Problem 3, Section 2.7 and Problem 
2, Section 2.8). The standard errors of the least-squares estimates are 


defined as 
S.E{a@,} = S$ : + a 
EAngy = Oylx | — i ND: , 
ae x(x; — x) 


S.E{Bn} = Syjx/(O(i — ¥))'”, 


PART III: PROBLEMS 179 


where S = Q))|,/(n — 2). What are the distributions of (@, — a)/S.E.{@,} 
and of (6, — B)/S.E{Bn}? 


2.11.3 Let (uw) be the standard normal integral and let X ~ (x?[1])!/”. Prove that 
E{®(X)} = 3/4. 


2.11.4 Derive the formulae (2.11.8)—(2.11.10). 


211.5 LetX ~ N(ul, ¥), with B= 02(1 — p) (1 a 2), where -— e 
p < 1 (see Problem 4, Section 2.9). Let X and S be the (sample) mean and 
variance of the components of X. 

(i) Determine the distribution of X. 
(ii) Determine the distribution of $7. 
(iii) Prove that X and S* are independent. 
(iv) Derive the distribution of ,/n (X — 2)/S. 


2.11.6 Let t have the multivariate t-distribution t[v; &, 0? R]. Show that the covari- 


v 
ance matrix of tis X(t) = ae v>2. 
Vv _ 


Section 2.12 
2.12.1 Derive the p.d.f. (2.12.2) of F[vy, v2]. 


2.12.2 Apply formulae (2.2.2) and (2.12.3) to derive the relationship between the 
binomial c.d.f. and that of the F-distribution, namely 


ie ee 
Ra OH ip ono Se . a 
N-a 6 


(2.15.1) 


Notice that this relationship can be used to compute the c.d.f. of a central-F 
distribution with both v; and v2 even by means of the binomial distribution. 
For example, P{ F[6, 8] < 8/3} = B(3 | 6, +) = 89986. 


2.12.3 Derive formula (2.12.10). 


2.12.4 Apply formula (2.12.15) to express the c.d.f. of F[2m, 2k;2] as a Poisson 
mixture of binomial distributions. 


Section 2.13 


2.13.1 Find the expected value and the variance of the sample correlation r when 
the parameter is p. 
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2.13.2 


2.13.3 
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Show that when p = 0 then the distribution of the sample correlation r is 
symmetric around zero. 


Express the quantiles of the sample correlation r, when p = 0, in terms of 
those of t[n — 2]. 


Section 2.14 


2.14.1 


2.14.2 


2.14.3 


2.14.4 


2.14.5 


Show that the families of binomial, Poisson, negative-binomial, and gamma 
distributions are exponential type families. In each case, identify the canon- 
ical parameters and the natural parameter space. 


Show that the family of bivariate normal distributions is a five-parameter 
exponential type. What are the canonical parameters and the canonical 
variables? 


Let X ~ N(p, 07) and Y ~ N(p, o3) where X and Y are independent. 
Show that the joint distribution of (X, Y) is a curved exponential family. 


Consider n independent random variables, where X; ~ P(j(o! — a—!)), 
(Poisson), (i = 1,...,”). Show that their joint p.d.f. belongs to a two- 
parameter exponential family. What are the canonical parameters and what 
are the canonical statistics? The parameter space © = {(u,a):0<U< om, 
1 <a < oo}. 


Let T,, 72,..., T, be i.i.d. random variables having an exponential distribu- 
tion E(A). Let 0 < tg < oo. We observed the censored variables X1,..., Xn 
where 


X; = T{T; < to} + toL{T; = to}, 


i=1,...,n. Show that the joint distribution of X;,..., X, is a curved 
exponential family. What are the canonical statistics? 


Section 2.15 


2.15.1 


2.15.2 


Let X,,..., X, be i.i.d. random variables having a binomial distribution 
B(1, 6). Compare the c.d.f. of X,, for n = 10 and @ = .3 with the corre- 
sponding Edgeworth approximation. 


Let X,,..., X, be iid. random variables having a Weibull distribution 
G'/*(,, 1). Approximate the distribution of X,, by the Edgeworth approxi- 
mation forn = 20,a=2.5,A = 1. 
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1 
2.15.3 Approximate the c.d.f. of the Weibull, G'/¢(, 1), when w = 2.5, 4 = — 


by an Edgeworth approximation with n = 1. Compare the exact c.d_f. to the 
approximation. 


2.15.4 Let X,, be the sample mean of n i.i.d. random variables having a log-normal 
distribution, LN(j, 07). Determine the saddlepoint approximation to the 
p.d.f. of X,,. 
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2.2.2 Prove that YS dCisn, 0) = Ip(a,n—a+1). 


ja 


1 6 
I, 5a 1) = Sa a-1 ies na 
a CRECETy i a 


n! 
~ (a—Di(n —a)! 


i 6*(1 gy-4 at n! A aq] yroa- lg 
= = ————— | uw(l-u u 
a al(in—a—1)! Jo 


= b(a;n,0)+ h(a +1,n—a) 


0 
/ aay da 
0 


= b(a;n,0)+ b(a+13;n,0)+ Ip(a +2,n—a-—1) 
N 

Sher S> b(jsn, 0). 
j=a 


2.2.6 X ~ Pascal(@, v), i.e., 
; gal _ 
P{X =j}= Fi ed -ey”’, jp. 
Vv _— 
Letk = j —v,k > 0. Then 


P{X=jfJ=P{X-v=h 


_Téh+») a 
Saat eee: 


Let y = 1 — 0. Then 


Tk 
PUX =v =k) = yt — yy’, k>0. 
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2.3.2 
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Thus, X — v ~ NB(y, v) or X ~ v ++ NB(¥, v). The median of X is equal 
to v+ the median of NB(w, v). Using the formula 


NB(n | W,v) = h_y@,n+ 1), 


median of NB(y, v) = least > 0 such that Ig(v, 7 + 1) => 0.5. Denote this 
median by n.5. Then X5=v+ns5. 


U ~ RO, 1). ¥ ~ Bo | a,b). For0 < y <1 


PY Sy = Pip Ow |e. bye51 
= P{U < B(y | a,b)} = B(y | a,b) 
= I,(a, b). 


That is, B-!(U:a, b) ~ B(a, b). 


x2[v] ~ 2G (1. =): 


Fea oot eee Yea Been Pe 2k 
(prt) ~#9(1 5) ~ px 


1 
Hence, G7! (o: 7) = F 3i2k. 


X ~ LN(p, 0”). 


2 2 
fy = E{X} = E{eN# oO) = ebto/? 
h2= E{Xx?} = E{eeNuo)y = e2H+20? 
3 = E{x?} = E{(eeNuon as e3ht+ 30" 
h4 = E{x*} = E{etNo") —_ efli+ 80° 
wa = VX} = ete" — 1) 

13 = 3 — 3perHti + 2H} 
= elt 307 = 3e3Ht 507 aie remtio? 
= eh ti07(e8" _ 367° +2) 
* 2 3 
My = Ma — 43 + Om2ey — 344 


2 2 2 2 
_ etht2o (e — 4e°" + 6e" = 3). 
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3 


Coefficient of skewness 6, = Ga? 


e3> — 3e7° +2 


Bi = (er _ 1)3/2 


Coefficient of kurtosis 


BS 


(u3)° 
e& — 403°" 4 6% — 3 
(er — 1) 


po 


2.4.1 X,Y are independent rv. P{Y > 0} = 1, E{|X|} < o. 


£5} =eu-£{ 5} 


By Jensen’s inequality, 
> E{X}/E{Y}, if E{X}>0 
< E{X}/E{Y}, if E{X} <0. 


2.4.4 X,Y arei.id. like N(0, 1). Find the distribution of R= X/Y. 


Fr(r) = P{X/Y <r} 
= P{X <rY,Y >0} 
+ P{X>ry,Y <0} 


= i, $y) ®(ry)dy 

0 
+ i $(\(1 — &(ry))dy 
“ i $y) ®(ry)dy 
+ [ $()(1 — &(—ry))dy 
=2 i p(y) P(ry)dy 


1 1 “4 
=.-+-—tan (r). 
2 #2 
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This is the standard Cauchy distribution with p.d.f. 


1 1 
te aka pero —0o0o<r<o. 


Moments of R do not exist. 


2.4.9 X,Viid. E(A),U =X+Y,W=X-Y,U ~ G(A, 2), fy(u) = ue. 
(i) The joint p.d.f. of (X, Y) is f(x, y) = A2e7* 4), 0 < x, y < 00. Make 
the transformations 


1 
u=x+y r= putw) 
= Se ) 
w=x-y y= su). 
The Jacobian is 
1 1 
a 2 1 
J= =r. 
1 1 2 
2 2, 


2 
The joint p.d.f. of (U, W) is gu, w) = te 0O<u<a@,|w| <u. 
The conditional density of W given U is 


Veh 
wluy= = I{-u <w <u}. 
gw |) Que =u 


That is, W | U ~ R(-U, UV). 
(ii) The marginal density of W is 


eae ha r 
fww)= aI edu = ra —0O <W <O. 
Iw] 


2.5.4 All moments exist. w; = E(X'),i = 1,..., 4. 
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(i) 
E({X*} = SE “e Xx; ) 
= Box X; +390 0x? xX} 
ifj ifj 
4 6 > SX? Xj X; is x Pz y So X)Xj XX, 
ix jth if jtkAl 
= Sly + 4n(n — Ishi + 3n(n = 1) 
+ 6n(n — 1)(n = 2)uopej + n(n — 1(n — 2)(n — 3) 
=pit * (6u241 — 6ut) + Syst + 33 
— 18u2uj + 11h) + + (4 — 4u3n — 345 
+ 12p2u7 — 6u}). 
(ii) 
HX) = SE : XP+5)° > XIX; 
ifj 
+1097 S°XPXF + 1090 S07 SEXP XX + ISD) VOY EXP XE, 
ixj if jek ix jth 
+1090 90 0 SO XPX XX t+ DY EX XXX Xn 
it jek Al it jpkAlem 


z | 
ELC = —3 (is + Sn(n — Vmapi + 1On(n — Duseer 


+ 10n(n — 1)(n — 2)u3u; + 15n(n — 1)(n — 2)u5 M1 
+ 10n(n — 1m = 2)(n = 3p} + nn = In — 2m = 3) — 49} 


1 

= Hj + —(10m2H} — 1017) 

+ +c 2 155 60> + 35> 
pie Obata F tong — 80M My) 


1 
+ 73 OMSL + 10u32 — 30u3m7 — 45511 
+ 11026} — 507) 


1 
+ (us ~ Soattr — Wsy2 + 2037 
+ 3051 — 60u2"} + 24u}). 
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(iii) Assume yz; = 0, 


E((X — 1)} = SE Di XP +6) > DOXPX; 
i ifxj 


+1590 S°XEXE + 1590 SS OXIX |X, + 10D) YX} X} 


iZj it ith iZj 


+609) SS °XPXFX, +2090 0 0 SXF XXX 


ix jxk iA] #KAL 


- 15) > DEXPXGXE ab 45) S > SO XP XIX, X, 


if jZk if jpkAl 
+1590 0 S0 SS SOX? X | XiXiXm 


ip jtkéleém 


+) > Pa > ye SOX) XjXiXiXmXn 


it jphAlemén 


1 
= —5 Ms + 15n(n — lexus + 10n(n — 1)? 
+ 15n(n — 1)(n — 2)u33}. 
7 1 1 
E((X — w)*} = S15py + GS wiag + 10m}? — 45105") 
1 
+ (mg — 154445 — 1013? + 3013"). 


2.6.3 (X;,X2) has a_ bivariate NB(6,,02,v). The marginals X, ~ 


0 0 
NB ty |; Xo NB awry 
1-6 1-6 
(i) 
(Xx X>) vO} 02 
COV! SS 
02 1 — 6, — 6 
v61(1 — 62) 
Vix,} = ———_—_— 
re eso 
vO.(1 — 6; 
V{ix>} = — __ 
Pal Lee 


7 9105 
ame cee cee 
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(ii) X; | X> ~ NB, v + X>). 


6 
E(X1 | X2) = (v + Xo) 7 
— UY 
a 
V(X, | X2) = Wt og aa Rye 
(iii) pesise ao 
E{V{X, | Xo} = aoe + Xo} 
a) 
Xo ~ NB >vV 
aes 
Hes — 2 
ants ae ees 


02 1-6 
E{X} = 1 — 
y+ BUG) = v(14 j-S 


1-0, -—& 
E{V{X, | X}} =v 1 
Pe @=)d =o) — 0s) 
_ vO,(1 — 0) 
WO re ame 
E{V{X, | X2}} = 1— 6, —@ 
V{X1) (1 — 6) — 6) 
1-0, — 010 


D?=1 = : 
(1-6) —) U— 6) — 62) 


Notice that PX, y= D?. 


2.7.4 (i) The conditional m.g.f. of X given Y is 
/ 1 / 
Myyy(t) = exp (t AY + 5 qt). 
Hence, the m.g.f. of X is 
1 / / / / 
Mx(t) = exp at Tt+tAn+tADAt}). 


Thus, X ~ N(An, ¥ + ADA’). 
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xX 
(ii) The joint distribution of e is 


x a An X+ADA’ AD 
Y n | DA' D : 
It follows that the conditional distribution of Y given X is 


Y |X ~ N(j+ DA'(X + ADA’) !(K — An), D— DAY + ADA’)! AD). 


Z 1 
2.7.5 ( ) és v(o.( a) 
Zo p 1 
0, ifZ, <0 ; 
X; = . ‘= 13:2; 
1, if Z; >0. 
1 
E{X;} = P{X; = 1} = P{Z; > 0} = 5 
vixjet, i=1,2 
is Ss t= 1,2. 
4 


1 
cov(X), X27) = E(X;X2) — a 


E(X,X2) = P(Z, = 0, Z> > 0) 


ll 
Y — 
o— 
8 
fax} 
| 
NI 
NS. 
C—— 
8 4 
4 
nr 
| 
NIK 
Sate 
iw 
~ 
Q 
NX 


1 1 
eee doe (=e) 
4 20 t= 2 


= 24 J sin) 
= 445, sin (0). 


Hence, 
1, i lag 
cov(X,, X2) = — sin ‘(¢). 
20 


The tetrachoric correlation is 


L onl 
s> sin 2 
= 2 Sin” (0) =— sin’ !(p) 
1/4 A 
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or 


2.7.11 R= a gle ly 7 Siam, 


’ 


Let Uo, Ui,...,Um be iid. N(O,1) random variables. Define Z; = 


2; Uo + jl—4 U;, j=1,...,m. Obviously, E{Z;} = 0, V{Z;} = 1, 
j=1,...,m, and cov(Z;, Z;) = AjA; if i # j. Finally, since Uo, ..., Un 
are independent 


P[Z, <h,..., Zm < hm) = P[AUp + V1 -AT U1 <hy,...,AmUo 
se Un < hn) 
—ij 


2 
fms 


2.10.1 Xy,..., Xn areiid. EQ). 
() fry @) = nko, 
(ii) fx.) = nhe* (1 — ey}, 
(ili) fx, x.y) = a(n — Dare POM (e“* — PY" Oca <y. 
As shown in Example 2.12, 


: 1 
(iv) E(Xqy) = = 
7 
V(X 
(Xp) = Gap 
E{Xw} era a Pr 
co 2 n 
Ll 
V{Xwm} = 32 Dy ja 


1 n 1 
cov(Xa), Xa) = E{Xa)X@} — Ae Ss me 
= 


n—1 1 
E{XXw)} = — (2 +0 +). 


i=1 
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Thus, 
1 


1 2 1 
KOM Kay Xo )is ne\n n) nae 


Lup 


PXa).Xany = - 1/2" 
( ; 
i=1 
Also, as shown in Example 2.12, X(1) is independent of X(,) — X(1) 
Thus, 
1 
cov(X1), X(n)) = cov(X1), X(1) + (Xm) = X«1))) = V{Xp} = PUY 
2.10.3 We have to derive the p.d.f. of R = Xi) — Xi), where X, X2, X3 are iid 
N(w, o”). Write R = o(Z3) — Zay), where Z;,..., Z3 are iid. N(O, 1). 
The joint density of (Z(1), Z3)) is 


6 1(,24.,2 
Sy,@)Z1, 22) = gf Oe) — &(z1)), 


where —0o < z; < 73 < &. 
Let U = Ze) — Za). The joint density of (Z,1), UV) is 


2 
(ZU) = ge exp (- (< + >") (P(z + u) — P(z)), 
20 2 


—00 < z < 00,0 <u < oo. Thus, the marginal density of U is 


6 £2 2: oO 1\2 
-je 2 e &t2) . (@(z +. u) — ®(z))dz 


aoa Sea 8 
“* (o(g!)ee()-9). 


E{®(X)} = P{U < X}, 


_ 6 
Oe 


2.11.3 
where U and X are independent. Hence, 


PASCO = Sas 


CHAPTER 3 


Sufficient Statistics and the 
Information in Samples 


PART I: THEORY 
3.1 INTRODUCTION 


The problem of statistical inference is to draw conclusions from the observed sample 
on some characteristics of interest of the parent distribution of the random variables 
under consideration. For this purpose we formulate a model that presents our assump- 
tions about the family of distributions to which the parent distribution belongs. For 
example, in an inventory management problem one of the important variables is the 
number of units of a certain item demanded every period by the customer. This is a 
random variable with an unknown distribution. We may be ready to assume that the 
distribution of the demand variable is Negative Binomial N B(y, v). The statistical 
model specifies the possible range of the parameters, called the parameter space, 
and the corresponding family of distributions ¥. In this example of an inventory 
system, the model may be 


F ={NB(yW,v);0 < W < 1,0 <v < oo}. 


Such a model represents the case where the two parameters, w and v, are unknown. 
The parameter space here is © = {(W, v);0 < Y < 1,0 < v < co}. Given a sample 
of n independent and identically distributed (i.i.d.) random variables X1,..., Xn, 
representing the weekly demand, the question is what can be said on the specific 
values of y and v from the observed sample? 

Every sample contains a certain amount of information on the parent distribution. 
Intuitively we understand that the larger the number of observations in the sample 
(on 1.1.d. random variables) the more information it contains on the distribution 


Examples and Problems in Mathematical Statistics, First Edition. Shelemyahu Zacks. 
© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc. 


191 


192 SUFFICIENT STATISTICS AND THE INFORMATION IN SAMPLES 


under consideration. Later in this chapter we will discuss two specific information 
functions, which are used in statistical design of experiments and data analysis. We 
start with the investigation of the question whether the sample data can be condensed 
by computing first the values of certain statistics without losing information. If such 
Statistics exist they are called sufficient statistics. The term statistic will be used 
to indicate a function of the (observable) random variables that does not involve 
any function of the unknown parameters. The sample mean, sample variance, the 
sample order statistics, etc., are examples of statistics. As will be shown, the notion 
of sufficiency of statistics is strongly dependent on the model under consideration. 
For example, in the previously mentioned inventory example, as will be established 
later, if the value of the parameter v is known, a sufficient statistic is the sample 


ee et 
mean X = -S°Xx ;. On the other hand, if v is unknown, the sufficient statistic is 
n 
i=l 
the order statistic (X(1),..., X(n)). When v is unknown, the sample mean X by itself 
does not contain all the information on w and v. In the following section we provide 
a definition of sufficiency relative to a specified model and give a few examples. 


3.2. DEFINITION AND CHARACTERIZATION 
OF SUFFICIENT STATISTICS 


3.2.1 Introductory Discussion 


Let X = (X,..., X,) be a random vector having a joint c.d.f. Fg(x) belonging to a 
family F = {Fo(x);6 € ©}. Such a random vector may consist of n 1.i.d. variables 
or of dependent random variables. Let T(X) = (7,(X),...,7;(M), 1<r<n 
be a statistic based on X. T could be real (r = 1) or vector valued (r > 1). The 


transformations T;(X), j =1,...,r are not necessarily one-to-one. Let f(x; 6) 
denote the (joint) probability density function (p.d.f.) of X. In our notation here 7; (X) 
is a concise expression for T;(X,,..., X,). Similarly, F(x) and f(x; @) represent 
the multivariate functions Fo(x,,...,X,) and f(%1,...,%n;0). As in the previous 


chapter, we assume throughout the present chapter that all the distribution functions 
belonging to the same family are either absolutely continuous, discrete, or mixtures 
of the two types. 


Definition of Sufficiency. Let F be a family of distribution functions and let 
X = (Xj,..., X,) be a random vector having a distribution in F. A statistic T (X) is 
called sufficient with respect to F if the conditional distribution of X given T (X) is 
the same for all the elements of F. 


Accordingly, if the joint p.d-f. of X, f(x;@), depends on a parameter 0 and T(X) 
is a sufficient statistic with respect to F, the conditional p.d.f. h(x | t) of X given 
{T(X) = t} is independent of @. Since f(x; 0) = A(x | t)g(t;@), where g(t; 0) is the 
p.d.f. of T(x), all the information on 6 in x is summarized in T(x). 
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The process of checking whether a given statistic is sufficient for some family 
following the above definition may be often very tedious. Generally the identification 
of sufficient statistics is done by the application of the following theorem. This 
celebrated theorem was given first by Fisher (1922) and Neyman (1935). We state the 
theorem here in terms appropriate for families of absolutely continuous or discrete 
distributions. For more general formulations see Section 3.2.2. For the purposes of 
our presentation we require that the family of distributions F consists of 


(i) absolutely continuous distributions; or 
(ii) discrete distributions, having jumps ona set of points {&, &, . . .} independent 
[oe 


of 6, i.e., AGH) = | forall 0 € ©; or 
i=] 
(iii) mixtures of distributions satisfying (1) or (ii). Such families of distributions 
will be called regular (Bickel and Doksum, 1977, p. 61). 


The families of discrete or absolutely continuous distributions discussed in Chapter 
2 are all regular. 


Theorem 3.2.1 (The Neyman-Fisher Factorization Theorem). Let X be a random 
vector having a distribution belonging to a regular family F and having a joint p.d.f. 
f(x; 0), 6 € ©.A statistic T(X) is sufficient for F if and only if 


f(x; 0) = K(x)g(T(x); 6), (3.2.1) 


where K(x) > 0 is independent of 0 and g(T(x);@) => 0 depends on x only through 
T(x). 


Proof. We provide here a proof for the case of discrete distributions. 


(i) Sufficiency: 
We show that (3.2.1) implies that the conditional distribution of X given 
{T(X) = t} is independent of 6. The (marginal) p.d.f. of T(X) is, according 
to (3.2.1), 


g°(t50) = D> 1x; T(x) = 1} f(%:0) 
{x} 


= g(t;0) ) I(x; T(x) = 1} K(%). 
{x} 


(3.2:2) 


The joint p.d.f. of X and T(X) is 


P(x, t;0) = I{x; T(x) = t}K (x)g(t; 6). (3.2.3) 
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Hence, the conditional p.d.f. of X, given {7 (X) = ft} at every point ¢ such that 
g*(t;0) > 0, is 
P(X, 030) {x; T(x) = t} K(x) 


868) S Hy; TY) = Ky) 
{y} 


(3.2.4) 


This proves that 7'(X) is sufficient for F. 
(ii) Necessity: 
Suppose that T(X) is sufficient for F. Then, for every t at which the 
(marginal) p.d.f. of T(X), g*(t; 0), is positive we have, 


D(x, t;0) 


6) POS Hee), (3.2.5) 


where B(x) > 0 is independent of 0. Moreover, youl fy; T(y)=t} By =1 
{y} 


since (3.2.5) is a conditional p.d.f. Thus, for every x, 
P(x, 10) = I{x; T(x) = t} B(x)g"*(t; 0). (3.2.6) 


Finally, since for every x, 


p(x, t50) = I{x; T(x) = t} f(x; 8), (3.2.7) 

we obtain that 
f(x; 0) = B(x)g*(T(x); 9), — forall x. (3.2.8) 
QED 


3.2.2 Theoretical Formulation 


3.2.2.1 Distributions and Measures 

We generalize the definitions and proofs of this section by providing measure- 
theoretic formulation. Some of these concepts were discussed in Chapter 1. This 
material can be skipped by students who have not had real analysis. 


Let (Q, A, P) be a probability space. A random variable X is a finite real value 


measurable function on this probability space, i.e., X : QQ — R. Let ¥ be the sample 
space (range of X), i.e., V¥ = X(Q2). Let B be the Borel o-field on 1’, and consider the 
probability space (1, B, P*) where, for each B € B, P*{B} = P{X~!(B)}. Since 
X is arandom variable, B* = {A: A= X7'(B), BEB} CA. 


The distribution function of X is 


Fx(x) = P*{(—oo, x]}, —-00 <x <0. (3.2.9) 
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Let X,, X2,..., X, be m random variables defined on the same probability space 
(Q, A, P). The joint distribution of X = (X1,..., X,)/ is areal value function of R” 
defined as 


F(x1,..-,%n) =P {oc < vil. (3.2.10) 


i=1 


Consider the probability space (1, B™,P™) where YY =Xx-->x &X, 
B” = Bx.---x B (or the Borel o-field generated by the intervals (—oo, x;] x 
++ (—00, Xn], (41, ---,Xn) € R”) and for B ec B™ 


P{BY = / dF(x1,...,%n)- (3.2.11) 
B 


A function h: X” > R is said to be B”)-measurable if the sets h~!((—oo, ¢]) 
are in B™ for all —oo < ¢ < oo. By the notation h € B” we mean that h is B)- 
measurable. 

A random sample of size n is the realization of n i.i.d. random variables (see 
Chapter 2 for definition of independence). 

To economize in notation, we will denote by bold x the vector (x1, ...,x,), and 
by F(x) the joint distribution of (X,,..., X,). Thus, for all B ¢ B™, 


P™{B} = P{(X,..., Xn) € BJ = ? d F(x). (3.2.12) 
B 


This is a probability measure on (4), B“) induced by F(x). Generally, a o-finite 
measure ju on B is a nonnegative real value set function, i.e., 4: BY — [0, oo], 
such that 


(i) W(d) = 0; 
(ii) if ee ae is a sequence of mutually disjoint sets in B™, i.e., B} NB; = 
for any i ~ j, then 


UL (U ® = Se aah 
n=1 n=1 


(iii) there exists a partition of ¥, {B,, Bo,...} for which j(B;) < oo for all 
ta as ee 


The Lebesque measure / dx is ac-finite measure on B™. 
B 
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If there is a countable set of marked points in R”, § = {x , xX», x3, ...}, the counting 
measure is 


N(B; S) = {# of marked points in B} 
=(|BN S|. 


N(B; S) is ao-finite measure, and for any finite real value function g(x) 


if g(x)d N(x; S) = > Tix € SM B}g(x). 
B x 


Notice that if B is such that N(B; S) = 0 then [ scoanex: S) = 0. Similarly, if B 
B 
is such that i dx = 0 then, for any positive integrable function g(x), i g(x)dx = 0. 
B B 


Moreover, v(B) = / g(x)dx and A(B) = / g(x)d N(x; S) are o-finite measures on 
B B 


(X™, B™), 

Let v and yw be two o-finite measures defined on (4, B™). We say that v is 
absolutely continuous with respect to yw if 4(B) = 0 implies that v(B) = 0. We 
denote this relationship by v « w. If v « w and w < v, we say that v and wy are 
equivalent, v = 1. We will use the notation F < yw if the probability measure Pr is 
absolutely continuous with respect to w. If F « y there exists a nonnegative function 
f(x), which is 6 measurable, satisfying 


Pe(B)= f aro = [ fooducn. (3.2.13) 
B B 


F(x) is called the (Radon—Nikodym) derivative of F with respect to yw or the gener- 
alized density p.d.f. of F(x). We write 


dF(x) = f(x)du(x) (3.2.14) 
or 
= d F(x) 
LO ae (3.2.15) 


As discussed earlier, a statistical model is represented by a family F of distribution 
functions Fy on X”, 6 € ©. The family F is dominated by a o-finite measure ju if 
Fo < wu, foreach é € ©. 
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We consider only models of dominated families. A theorem in measure theory 
states that if F < y then there exists a countable sequence { Fo, aa C F such that 


oe) 


1 
F*@=)>> 5 Fay) (3.2.16) 


n=1 
induces a probability measure P*, which dominates F. 

A statistic T(X) is a measurable function of the data X. More precisely, let T : 
x” -» Tk > 1 and let C™ be the Borel o-field of subsets of J“. The function 
T(X) is astatistic if, for every C e C”, T~'(C) € B™. Let B? ={B: B=T~\(C) 
for C ¢ C“}, The probability measure P? on C™, induced by P*, is given by 


PT{cy = P*{T“ (cy, Cec®. (3.2.17) 


Thus, the induced distribution function of T is F7(t), where t € R* and 
F'(t)= / d F(x). (3.2.18) 
T~!((—00,11] x-+-(—00, te) 


If F < then F? < yp’, where w"(C) = p(T ~!(C)) for all C € C. The general- 
ized density (p.d.f.) of T with respect to z7 is g7(t) where 


dF" (t) = g" (t)du' (t). (3.2.19) 


If h(x) is B™ measurable and / |h(x)|d F(x) < 00, then the conditional expectation 


of h(X) given {T(X) = t} is a B’ measurable function, E-{h(X) | T(X) = t}, for 
which 


/ hor OS / Erth(X) | T(x)}d F(x) 
EO) acs (3.2.20) 
a Ep{h(X) | T(X) = t}dF" (t) 
GC 


for all C € C™. In particular, if C = T“ we obtain the law of the iterated expectation; 
namely 


Er{h(X)} = Er{Erth(X) | T(X)}}. (3.2.21) 


Notice that E;-{h(X) | T(X)} assumes a constant value on the coset A(t) = {x: 
T(x) =t} =T7!({t}), te T™. 
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3.2.2.2 Sufficient Statistics 

Consider a statistical model (4, B™, F) where F is a family of joint distributions 
of the random sample. A statistic T : (4, B™, F) > (T“,C, F*) is called 
sufficient for F if, for all B ¢ B™, Pr{B | T(X) = t} = p(B;t) forall F € F. That 
is, the conditional distribution of X given T(X) is the same for all F in F. Moreover, 
for a fixed t, p(B;t) is B™ measurable and for a fixed B, p(B;t) is C“ measurable. 


Theorem 3.2.2. Let (¥™,B™,F) be a statistical model and F « yw. Let 
CO 


1 
{Fo}, C F such that F*(x) = Do Fa) and F « P*. Then T(X) is suffi- 
n=1 
cient for F if and only if for each 6 € ® there exists a B' measurable function 
go(T(x)) such that, for each B € BY 


Po{B} = | sacred F'n, (3.2.22) 


d Fo(x) = go(T(x))d F*(x). (3.2.23) 
Proof. (i) Assume that T(X) is sufficient for F. Accordingly, for each B ¢ B”, 

Po{B | T(X)} = p(B, T(X)) 
for all 6 € ©. Fix B in B™ and let C € C™. 

Po BAT \(C)} = / p(B, T(x))d Fo(x), 
T-(C) 
for each 9 € ©. In particular, 
p(B, T(X)) = E*{I{X € B} | T(X)}. 


By the Radon-Nikodym Theorem, since Fy < F* for each 0, there exists a B7 
measurable function gg(7(X)) so that, for every C € C™, 


P,{T~'(C)} =f 80(T (x))d F*(x). 
T-\(C) 
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Now, for B € B™ and @ € @, 


Po{B} = [., @ | T(x))d Fo(x) 
=a E*{I{X € B} | T(x)}ge(T(x))d F*(x) 
T-\(T) 


= [ go(T (x))d F*(x). 


Hence, d Fg(x)/dF*(x) = ga(T(x)), which is B? measurable. 
(ii) Assume that there exists a B7 measurable function gg(T(x)) so that, for each 
d€O, 


d Fo(x) = go(T(x))d F*(x). 


Let A € B” and define the o-finite measure dv(x) = I{x € A}d Fo(x). v « F*. 
Thus, 


dv9(x)/d F*(x) = Pa{A | T(x)}go(T (x) 
= P*{A| T(x)}go(T(x)). 


Thus, Po{A | T(X)} = P*{A | TCX} for all 6 € ©. Therefore T is a sufficient 
Statistic. QED 


Theorem 3.2.3 (Abstract Formulation of the Neyman-Fisher Factorization 
Theorem). Let (¥”, B™, F) be a statistical model with F « ys. Then T(X) is 
sufficient for F if and only if 


d Fo(X) = go(T(X))h(X)du(x), 8€O (3.2.24) 


where h > Oandh € B™, gg € B’. 


(oe) 


1 
Proof. Since F < p, A Fs, ae Cc F, such that F*(x) = ote (x) dominates 


n=1 
F. Hence, by the previous theorem, 7 (X) is sufficient for F if and only if there exists 
a B™ measurable function go(T(x)) so that 


d Fo(x) = go(T(x))d F"(x). 
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erg 
Let fo, (x) = d Fo, (x)/dj(x) and set h(x) = a So, (x). The function h(x) € BO 


n=1 
and 


dFo(x) = go(T(x))h(x)d u(x). 


QED 
3.3. LIKELIHOOD FUNCTIONS AND MINIMAL 
SUFFICIENT STATISTICS 
Consider a vector X = (X,..., X,)/ of random variables having a joint c.d.f. Fg(x) 


belonging to a family F = {F9(x);@ € O}. It is assumed that F is a regular family 
of distributions, i.e., F « wu, and, for each 6 € ©, there exists f(x; 6) such that 


dFo(x) = f(x; @)du(x). 


J (x; @) is the joint p.d.f. of X with respect to j(x). We define over the parameter space 
© aclass of functions L(@; X) called likelihood functions. The likelihood function 
of 6 associated with a vector of random variables X is defined up to a positive factor 
of proportionality as 


L(O;X) x f(X; 8). (3.3.1) 


The factor of proportionality in (3.3.1) may depend on X but not on @. Accord- 
ingly, we say that two likelihood functions L,(@; X) and L4(@; X) are equivalent, i.e., 
L,(6;X) ~ L2(0; X), if L1(6;X) = ACX)L2(0; X) where A(X) is a positive func- 
tion independent of 6. For example, suppose that X = (X,..., X,,)/ is a vector of 
i.i.d. random variables having a N(@, 1) distribution, —oo < @ < ov. The likelihood 
function of 6 can be defined as 


(3.3.2) 


3 1 n n o 
here X = —) X; andQ = ) (X;— X)’ and1/=(l,...,1 
where as and Q a )* an ( ) or as 


i=l i=1 


L(6;X) = exp {-5(7) = ey (3.3.3) 
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where T(X) = 237 We see that for a given value of X, L,(0;X) ~ L2(0;X). 
i=l 

All the equivalent versions of a likelihood function L(6;X) belong to the same 

equivalence class. They all represent similar functions of 6. 

If S(X) is a statistic having a p.d.f. g°(s;6), (@ € @), then the likelihood func- 
tion of @ given S(X)=s is L5(6;s) x g5(s;@). L5(6;s) may or may not have 
a shape similar to L(@;X). From the Factorization Theorem we obtain that if 
L(6;X) ~ L5(6; S(X)), for all X, then S(X) is a sufficient statistic for F. The 
information on @ given by X can be reduced to S(X) without changing the fac- 
tor of the likelihood function that depends on @. This factor is called the kernel 
of the likelihood function. In terms of the above example, if T(X) = X, since 


- 1 T(X 7 2 
X~wN(o,—),L™ (0:1) = exp {-F 0-8) |. Thus, for all x such that T(x) = 1, 
n 


L*(6;t) ~ L\(0;x) ~ L2(;x). X is indeed a sufficient statistic. The likelihood func- 
tion L™(@;7(x)) associated with any sufficient statistic for F is equivalent to the 
likelihood function L(@;x) associated with X. Thus, if T(X) is a sufficient statistic, 
then the likelihood ratio 


L"(0; T(X))/L@;X) 


is independent of 6. A sufficient statistic T(X) is called minimal if it is a function 
of any other sufficient statistic SCX). The question is how to determine whether a 
sufficient statistic 7(X) is minimal sufficient. 

Every statistic S(X) induces a partition of the sample space x“ of the observable 
random vector X. Such a partition is a collection of disjoint sets whose union is 
x. Each set in this partition is determined so that all its elements yield the same 
value of S(X). Conversely, every partition of x’ corresponds to some function of 
X. Consider now the partition whose sets contain only x points having equivalent 
likelihood functions. More specifically, let x° be a point in x. A coset of x° in this 
partition is 


C(x°) = {x; L(0;x) ~ L(@;x°)}. (3.3.4) 


The partition of x is obtained by varying x° over all the points of x. We call 
this partition the equivalent-likelihood partition. For example, in the N(@, 1) case 


—co <0 < ©, each coset consists of vectors x having the same mean x = —1'x. 


These means index the cosets of the equivalent-likelihood partitions. The statistic 
T(X) corresponding to the equivalent-likelihood partition is called the likelihood 
statistic. This statistic is an index of the likelihood function L(@; x). We show now 
that the likelihood statistic T (X) is a minimal sufficient statistic (m.s.s.). 

Let x and x®) be two different points and let T(x) be the likelihood statistic. 
Then, T(x) = T(x) if and only if L(@;x") ~ L(@;x™). Accordingly, L(@;X) 
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is a function of T(X), ie., f(X;@) = A(X)g*(T(X); 0). Hence, by the Factoriza- 
tion Theorem, 7(X) is a sufficient statistic. If SCX) is any other sufficient statis- 
tic, then each coset of S(X) is contained in a coset of T(X). Indeed, if x“ and 
x are such that S(x) = S(x) and f(x”, 0) > 0 Gi = 1, 2), we obtain from the 
Factorization Theorem that f(x; 6) = k(x) 9(S(x); 6) = k(x) g(S(x™); 0) = 
k(x) f(x; 0)/k(x), where k(x®) > 0. That is, L(6;X) ~ L(@; xX) and 
hence T(X) = T(X™). This proves that T(X) is a function of S(X) and there- 
fore minimal sufficient. 

The minimal sufficient statistic can be determined by determining the likelihood 
statistic or, equivalently, by determining the partition of x having the property that 
f(x; 6)/f (x; 6) is independent of 6 for every two points at the same coset. 


3.4 SUFFICIENT STATISTICS AND EXPONENTIAL TYPE FAMILIES 


In Section 2.16 we discussed the k-parameter exponential type family of distributions. 
If X,,..., X, are iid. random variables having a k-parameter exponential type 
distribution, then the joint p.d.f. of X = (X1,..., X,), in its canonical form, is 


foot... vo =] [2@)- 
; a : (3.4.1) 
exp {» Yo UG) +o + Ve DS Ue) — 2K, vo] : 


i=l i=l 


It follows that T(X) = { )°U\(X)),..., Suicr] is a sufficient statistic. The 
statistic T(X) is minimal suficlent if the cananivers {U1,..., We} are linearly inde- 
pendent. Otherwise, by reparametrization we can reduce the number of natural param- 
eters and obtain an m.s.s. that is a function of T(X). 

Dynkin (1951) investigated the conditions under which the existence of an m.s.s., 
which is a nontrivial reduction of the sample data, implies that the family of dis- 
tributions, F, is of the exponential type. The following regularity conditions are 
called Dynkin’s Regularity Conditions. In Dynkin’s original paper, condition (iii) 
required only piecewise continuous differentiability. Brown (1964) showed that it is 
insufficient. We phrase (iii) as required by Brown. 


Dynkin’s Regularity Conditions 


(i) The family F = {Fo(x);0 € ©} is a regular parametric family. © is an open 
subset of the Euclidean space R*. 

(ii) If f(x;@) is the p.d-f. of F(x), then S = {x; f(x; 0) > 0} is independent 
of 6. 
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(iii) The p.d.f.s f(x; 6) are such that, for each 6 € ©, f(x;8) is a continuously 
differentiable function of x over x. 


(iv) The p.d.f.s f(x; 6) are differentiable with respect to 0 for each x € S. 


Theorem 3.4.1 (Dynkin’s). Jf the family F is regular in the sense of Dynkin, 
and if for a sample of n > k i.i.d. random variables U,(X), ..., U;CX) are linearly 
independent sufficient statistics, then the p.d.f. of X is 


k 
f(&30) = h(x)exp yd) wi(O)Ui (x) + cw ; 


i=] 
where the functions (0), ..., W,(@) are linearly independent. 


For a proof of this theorem and further reading on the subject, see Dynkin (1951), 
Brown (1964), Denny (1967, 1969), Tan (1969), Schmetterer (1974, p. 215), and 
Zacks (1971, p. 60). The connection between sufficient statistics and the exponential 
family was further investigated by Borges and Pfanzagl (1965), and Pfanzagl (1972). 
A one dimensional version of the theorem is proven in Schervish (1995, p. 109). 


3.5 SUFFICIENCY AND COMPLETENESS 


A family of distribution functions F = {Fe(x);@ € ©} is called complete if, for any 
integrable function )(X), 


[read rece =0 forall 0c @ (3.5.1) 


implies that P,[h(X) = 0] = 1 for all 6 € O. 

A statistic T(X) is called complete sufficient statistic if it is sufficient for a 
family F, and if the family 7 of all the distributions of T(X) corresponding to the 
distributions in F is complete. 

Minimal sufficient statistics are not necessarily complete. To show it, consider 
the family of distributions of Example 3.6 with | = & = &. It is a four-parameter, 
exponential-type distribution and the m.s:s. is 


m 


T(X, Y) = (sox. ox, rah Sow). 
i=1 i=l i=l i=1 


The family F” is incomplete since Ey {yo _ yy, | = 0 for all 9 = (&, oj, 09). 


i=1 i=l 


But Po [yx — yey | = 0, all 0. The reason for this incompleteness is that when 


i=l i=1 
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&, = & the four natural parameters are not independent. Notice that in this case the 
parameter space Q = {W = (Wi, Wo, 3, W4)s Wi = VoW3/Wa4} is three-dimensional. 


Theorem 3.5.1. Jf the parameter space Q corresponding to a k-parameter expo- 
nential type family is k-dimensional, then the family of the minimal sufficient statistic 
is complete. 


The proof of this theorem is based on the analyticity of integrals of the type 
(2.16.4). For details, see Schervish (1995, p. 108). 

From this theorem we immediately deduce that the following families are 
complete. 


. BIN, 6), 0 < 6 < 1; N fixed. 

~ P(A)0O<A<o. 

. NB(wW, v), 0 < w < 1; v fixed. 

~ GOA,v),0<A<w,0<vV<@. 


- B(p,q),0 < p,q < o. 


. N(u, 07), -~ < w<w,0<0 <~@. 
n 


7. M(N, 0), 0 = (@1,...,%),0 < > 6; <1; N fixed. 
i=l 


8. N(w, V), w € R™; V positive definite. 


An fb WN = 


We define now a weaker notion of boundedly complete families. These are 
families for which if h(x) is a bounded function and Ey {h(X)} = 0, for all 6 € ©, 
then Pg {h(x) = 0} = 1, for all 6 € ©. For an example of a boundedly complete 
family that is incomplete, see Fraser (1957, p. 25). 


Theorem 3.5.2 (Bahadur). If 7 (X) is a boundedly complete sufficient statistic, then 
T (X) is minimal. 


Proof. Suppose that S(X) is a sufficient statistic. If SCX) = w(T(X)) then, for any 
Borel set B € B, 


E{P{B | SCO} | TCO} = P{B | SCH} as. 
Define 
A(T) = E{P{B | SCX)} | TCO} — P{B | TR}. 


By the law of iterated expectation, E,{h(T)} = 0, for all 6 € ©. But since T(X) is 
boundedly complete, 


E{P{B | S(X)} | T(X)} = P(B | S(X)} = P{B| TH} as. 
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Hence, T € B’, which means that T is a function of S. Hence 7T(X) is an 
M.S.s. QED 


3.6 SUFFICIENCY AND ANCILLARITY 


A statistic A(X) is called ancillary if its distribution does not depend on the 
particular parameter(s) specifying the distribution of X. For example, suppose that 
X ~ N(O1,,1,), —co < 6 < oo. The statistic U = (Xo — X,..., X, — Xj) is 
distributed like N(O,—1, In—1 + Jn—1). Since the distribution of U does not depend 
on @, U is ancillary for the family F = {N(O1, 1), —co < 6 < ov}. If S(X) is a 
sufficient statistic for a family F, the inference on 6 can be based on the likelihood 
based on S. If fs(s;@) is the p.d.f. of S, and if A(X) is ancillary for F, with p.d_-f. 
h(a), one could write 


Ps(s30) = pg(s | ayh(a), (3.6.1) 


where pj(s | a) is the conditional p.d.f. of S given {A = a}. One could claim that, for 
inferential objectives, one should consider the family of conditional p.d.f.s F°!4 = 
{p(s | a), @ € O}. However, the following theorem shows that if S is a complete 
sufficient statistic, conditioning on A(X) does not yield anything different, since 
Ps(s;9) = p,(s | a), with probability one for each 6 € ©. 


Theorem 3.6.1 (Basu’s Theorem). Let X = (X,,..., X,)' be a vector of i.i.d. 
random variables with a common distribution belonging to F = { F(x), 0 € O}. Let 
T(X) be a boundedly complete sufficient statistic for F. Furthermore, suppose that 
A(X) is an ancillary statistic. Then T (X) and A(X) are independent. 


Proof. Let C € B4, where 64 is the Borel o-subfield induced by A(X). Since the 
distribution of A(X) is independent of 0, we can determine P{A(X) € C} without 
any information on 0. Moreover, the conditional probability P{A(X) € C | T(X)} is 
independent of 6 since T(X) is a sufficient statistic. Hence, P{A(X) € C | T(X)} — 
P{A(X) € C} is a statistic depending on T(X). According to the law of the iterated 
expectation, 


Eo{P{A(X) € C | T(X)} — P{A(CX) € C}} =0, all OE O. (3.6.2) 
Finally, since T(x) is boundedly complete, 
P{A(X) € C | TC)} = P{AC®) € C} (3.6.3) 


with probability one for each 6. Thus, A(X) and T(X) are independent. QED 
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From Basu’s Theorem, we can deduce that only if the sufficient statistic SCX) is 
incomplete for 7, then an inference on @, conditional on an ancillary statistic, can be 
meaningful. An example of such inference is given in Example 3.10. 


3.7 INFORMATION FUNCTIONS AND SUFFICIENCY 


In this section, we discuss two types of information functions used in statistical anal- 
ysis: the Fisher information function and the Kullback—Leibler information function. 
These two information functions are somewhat related but designed to fulfill differ- 
ent roles. The Fisher information function is applied in various estimation problems, 
while the Kullback—Leibler information function has direct applications in the the- 
ory of testing hypotheses. Other types of information functions, based on the log 
likelihood function, are discussed by Basu (1975), Barndorff-Nielsen (1978). 


3.7.1 The Fisher Information 


We start with the Fisher information and consider parametric families of distribution 
functions with p.d.f.s f(x; @), 8 € ©, which depend only on one real parameter 0. A 
generalization to vector valued parameters is provided later. 


Definition 3.7.1. The Fisher information function for a family F = {F(x;0);0 € 
©}, where dF(x;0) = f(x; @)du(x), is 


aq 2 
1(0) = Ee [ee F0x:0) | (3.7.1) 


0 
Notice that according to this definition, — log f(x;@) should exist with prob- 


ability one, under Fg, and its second moment should exist. The random variable 
0 
— log f(x; 6) is called the score function. In Example 3.11 we show a few cases. 


00 
We develop now some properties of the Fisher information when the density 


functions in ¥ satisfy the following set of regularity conditions. 


(i) © is an open interval on the real line (could be the whole line); 
0 
(ii) 56 FS (x; 6) exists (finite) for every x and every 0 € O. 


(iii) For each 6 in © there exists a5 < 0 and a positive integrable function G(x; 0) 
such that, for all ¢ in (9 — 6,6 +4), 


f(x3b) — f(x; 4) 


$28 < G(x; 8). (3.7.2) 
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2 
(iv) 0 < Eg {[rveraxn| | < oo for each 6 € ©. 


One can show that under condition (iii) (using the Lebesgue Dominated Conver- 
gence Theorem) 


a a 
a f Fe dueey = fF poreyaucx) 
i (3.7.3) 
= Ey { arloe rcxion} = 0, 


for all 9 € ©. Thus, under these regularity conditions, 
0 
I(6) = Vo a9 08 £(%59) ‘ 


This may not be true if conditions (3.7.2) do not hold. Example 3.11 illustrates such 
acase where X ~ R(O, 0). Indeed, if X ~ R(O, 0) then 

a ° dx _ 

dO Jo O 


a ly 1 
— sax =-H. 
9 00 0 6 


0 
Moreover, in that example Vg {= log f(X | = 0 for all 9. Returning back to 


cases where regularity conditions (3.7.2) are satisfied, we find that if X,,..., X, are 
i.i.d. and J,(@) is the Fisher information function based on their joint distribution, 


I,(0) = Eg [3% log rox) | ‘ (3.7.4) 
Since X,,..., X, are i.i.d. random variables, then 
x log f(X;0) = 2 as log f(Xi; 9) (3.7.5) 
00 0d 
and due to (3.7.3), 
[,@) =nI(@). (3.7.6) 


Thus, under the regularity conditions (3.7.2), 7(@) is an additive function. 
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We consider now the information available in a statistic § = (S,(X),..., S,(X)), 


where 1 <r <n. Let 2°(1, ..., ¥~3@) be the joint p.d.f. of S. The Fisher information 
function corresponding to S is analogously 


2 
1°(0) = Eg {[zriseetar....¥o0) | (3.7.7) 


We obviously assume that the family of induced distributions of S satisfies the 
regularity conditions (i)—(iv). We show now that 


1,(0) > 15(@), all 0€@. (3.7.8) 

We first show that 
z log g°(y;0) = E Q log f(X;0) | S= a.s (3.7.9) 
59 088 (39) = Ey =, log FX: = yr, as. ols 


We prove (3.7.9) first for the discrete case. The general proof follows. Let A(y) = 
{x; Si(x) = y1,..., S-(X) = y,}. The joint p.d-f. of S at y is given by 


g9(930) = D> Thxsx € A(y)} £050), (3.7.10) 
where f(x; 6) is the joint p.d.f. of X. Accordingly, 


a a 
E {35 log f(X;0) | ¥ = | =) {xx € A(y)} (= log fo:0)) 
- f (x;0)/g°(y38). (3.7.11) 


Furthermore, for each x such that f(x; @) > 0 and according to regularity condition 


(iil), 


0 1 0 
Eo { ay loe FORGE) Pe »| = moa A(y)} - ag 1 9) 
ye Sy. (3.7.12) 
= 508 (v3 4)/8° (V3) 


F) 
= — log 2°(y:8). 
50 og g°(y; 0) 
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To prove (3.7.9) generally, let S : (¥, B, F) — (S,T,G) be a statistic and F « 
and ¥ be regular. Then, for any C € T, 


i Ey {gy ee f(x) |S= s} g(s;)dX(s) 
c | | a0 


0 
= / (3508 f.) f(x; 0)d u(x) 
S-1(C) 00 


a a 
= i 4 39 EO = or / oe f(x:@)du(x) — (3.7.13) 


0 r) 
= 59 |. 8:0)4x0) = i, 5p SOV OAS) 


r) 
=[(3 log e(s:0)) g(s;0)dX(s). 


Since C is arbitrary, (3.7.9) is proven. Finally, to prove (3.7.8), write 


F) a 2 
0 < Ee ie log f(X;6) — = log c1:0)| | 


= 1,(0) + 1°(6) — 2E¢ {a log f(X; @) - abe e'crs0)| 
(3.7.14) 


a 
= 1,(0) + 15) — 2a = log g°(Y;0) - 


a 
- Eo {55 log f(X; 6) | v} = 1,(0) — 1°(0). 
We prove now that if T(X)is a sufficient statistic for 7, then 


I7(6)=1,(0), al 0€@. 


Indeed, from the Factorization Theorem, if T(X) is sufficient for F then f(x; 0) = 
2 


K(x)g(T(x); 9), for all 6 € ©. Accordingly, [,(0) = Eo [3 log 7X:0)] | 
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On the other hand, the p.d.f. of T(X) is g7(t;6) = A(t)g(t; 4), all 6 € ©. Hence, 


0 0 
30 log g7 (t;0) = 30 log g(t; @) for all 6 and all t. This implies that 


a 2 
17 (0) = Eo [aoe Troor0)| | 
(3.7.15) 
F) 2 
= Ey [3% log F(TO0;0)| | = I,(6), 


for all 6 € ©. Thus, we have proven that if a family of distributions, F, admits a 
sufficient statistic, we can determine the amount of information in the sample from 
the distribution of the m.s.s. 

Under regularity conditions (3.7.2), for any statistic U(X), 


0 
[,(0) =V {55 log FOX} 


0 
V {Eo low 0X0) | ue~}| 


+ E{v {© toe f0X:0) | veo} |. 


By (3.7.15), if UCX) is an ancillary statistic, log el (u; @) is independent of 0. In this 
0 7] 
case 30 log g4(u;0) = E {55 log f(X; 4) | u| = 0, with probability 1, and 


I,(0) = E{1(@ | U)}. 


3.7.2 The Kullback—Leibler Information 


The Kullback—Leibler (K—L) information function, to discriminate between two dis- 
tributions F4(x) and Fy(x) of F = {Fo(x);6 € ©} is defined as 


f (X50) 
f (X50) 


I(0;) = Eg {ie ; O0,¢€0. (3.7.16) 


The family F is assumed to be regular. We show now that /(6, @) > 0 with equality 
if and only if f(X;0) = f(X;¢@) with probability one. To verify this, we remind 
that log x is a concave function of x and by the Jensen inequality (see problem 8, 
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Section 2.5), log(E{Y}) => E {log Y} for every nonnegative random variable Y, having 
a finite expectation. Accordingly, 


dF (x;6) 


Bip onis {- 1 1) = fv fxs) 


f(X;) SF G:8) 


(3.7.17) 
< log f aFtx0) = 0. 


Thus, multiplying both sides of (3.7.17) by — 1, we obtain that (6, @) > 0. Obviously, 
if Po{ f(X;0) = f(X; ¢)} = 1, then 7(, d) = 0. If X1,..., X, are ii.d. random 
variables, then the information function in the whole sample is 


£(X58) | 1 SX 9) 
[,(0, 6) = Eo +1 =E log ———— + = nI(6, 9). 3.7.18 
(0,4) » flog AE@) 73 og aa nI(6,). (3.7.18) 


This shows that the K—L information function is additive if the random variables are 
independent. 

If S(X) = (S,(X), ..., S,(X)), 1 < r <n, is a statistic having a p.d.f. g5(y),..., 
y,;@), then the K—L information function based on the information in S(X) is 


S(Y;0) 
16, 6) = Ey {log & b 3.7.19 
(6, d) flo ( ) 
We show now that 
1°(6, 6) < 1,0, ¢), (3.7.20) 


for all 6, 6 € © and every statistic S(X) with equality if SCX) is a sufficient statistic. 
Since the logarithmic function is concave, we obtain from the Jensen inequality 


10.0) = Es {og uaa) ro {i a(x 


S(X : 3.7.21 
F(%:6) | ( || ( ) 


Generally, if S is a statistic, 


S:(X,B,F) > (S,T,9), 
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then for any C € T 


FRO), 6 _ | ose. 
[|e sas} (s; 0)dA(s) 


f(x) =a 
= 30)d = Py{S-(C 
i faa )du(s) = Pg{S~"(C)} 


= SH: ff SCO). 
=f wiordncs) = [| SR esi ydae, 


This proves that 


gi(ssb) E, {ae 


g5(s30) = F(X: 0) | SCX) = s} . (3.7.22) 


Substituting this expression for the conditional expectation in (3.7.21) and multiplying 
both sides of the inequality by —1, we obtain (3.7.20). To show that if S(X) is sufficient 
then equality holds in (3.7.20), we apply the Factorization Theorem. Accordingly, if 
S(X) is sufficient for F, 


LOG) _ K&S; 9) 
F(%:0) KK (x)g(S(x); 4) 


(3.7.23) 


at all points x at which K(x) > 0. We recall that this set is independent of 6 and has 
probability 1. Furthermore, the p.d.f. of SCX) is 


e°(y:0) = A(y)g(y36), all 0€O. (3.7.24) 


Therefore, 


I,(0, @) = Eo (108 ee?) 


A(S(X))g(S(X); @) 


=f {ie co | (3.7.25) 
g(Y;¢) 
=1°(6,¢), forall 0,¢€ 90. 


3.8 THE FISHER INFORMATION MATRIX 
We generalize here the notion of the information for cases where f(x; 6) depends on 
a vector of k-parameters. The score function, in the multiparameter case, is defined 


as the random vector 


S(O; X) = Vo log f(X:6). (3.8.1) 
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Under the regularity conditions (3.7.2), which are imposed on each component 
of 0, 


Eo {S@; X)} = 0. (3.8.2) 
The covariance matrix of S(@; X) is the Fisher Information Matrix (FIM) 
1(6) = Xe[S; X)]. (3.8.3) 


If the components of (3.8.1) are not linearly dependent, then /(@) is positive definite. 
In the k-parameter canonical exponential type family 


log f(X; W) = log A*(X) + w'U(X) — K(W). (3.8.4) 
The score vector is then 
S(y; X) = U(X) — Vy KW), (3.8.5) 
and the FIM is 


Ty) = Xy[UX)] 
(3.8.6) 


a . . 
7 (say, RWS e= leas) 


Thus, in the canonical exponential type family, 7(y) is the Hessian matrix of the 
cumulant generating function K(w). 

It is interesting to study the effect of reparametrization on the FIM. Suppose that 
the original parameter vector is 8. We reparametrize by defining the k functions 


wj =wj(,..., 9%), P= Voss ks 


Let 
6; =WjW1,.--,We), J=Hly...,k 
and 
Diw) = (Cees z “v) 
Then, 


S(w; X) = D(w)V y log f(x; W(w)). 
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It follows that the FIM, in terms of the parameters w, is 


I(w) = Lw[S(w; X)] 


(3.8.7) 
= D(w)1((w))D'(w). 


Notice that /(y(w)) is obtained from /(@) by substituting y(w) for 0. 

Partition 6 into subvectors 0, ..., (2 <1 <k). We say that 6... 0 are 
orthogonal subvectors if the FIM is block diagonal, with / blocks, each containing 
only the parameters in the corresponding subvector. 

In Example 3.14, and o? are orthogonal parameters, while y and w> are not 
orthogonal. 


3.9 SENSITIVITY TO CHANGES IN PARAMETERS 


3.9.1 The Hellinger Distance 


There are a variety of distance functions for probability functions. Following Pitman 
(1979), we apply here the Hellinger distance. 

Let F = {F(x;0), 0 € ©} be a family of distribution functions, dominated by a 
o-finite measure yu, i.e., dF (x;0) = f(x; 0)du(x), for all 0 € ©. Let 0), 6 be two 
points in ©. The Hellinger distance between f(x; 61) and f(x; 62) is 


‘ 1/2 
p(01, 0) = ( / [ W756) — VFO38)| anix)) ; (3.9.1) 


Obviously, o(@,, 62) = 0 if 6; = 4. 
Notice that 


p21.) = f forenducy+ [ fere)ancy—2 f VFosb)FO%B) d(x). 


(3.9.2) 


Thus, p(9, 62) < V2, for all 6), 0. € ©. 

The sensitivity of 0(0;, 00) at @ is the derivative (if it exists) of o(@, O), at 
6 = 6. 

Notice that 


2 he : 2 
p°(O, 6) = | Vf (34) ) Fines} (3.9.3) 


(0 — @) (0 — 0)? 
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If one can introduce the limit, as 6 > 9 , under the integral at the r.h.s. of (3.9.3), 
then 


_ p(9,%) f(a ° 
ah 0 — OP / (53 , FOP.) ans 


(3 fxs 0) (3.9.4) 
4 FOG) Oh 


le 
= 41). 


Thus, if the regularity conditions (3.7.2) are satisfied, then 


p(6,6) 1 
= ~/1(). 3.9.5 
a 6 — bol 5V (60) (3.9.5) 


Equation (3.9.5) expresses the sensitivity of o(@, 00), at 09, as a function of the 
Fisher information /(0). 

Families of densities that do not satisfy the regularity conditions (3.7.2) usually 
will not satisfy (3.9.5). For example, consider the family of rectangular distributions 
F = {R(0, 0),0 < 0 < co}. 

For 6 > @ > 0, 


1/2 
p(6,%) = v2 (: - *) 


—1/2 
Opie hie HO a 
B07 Be 0 0 2 


Thus, 


a 
Mar p(6, %) = 


1 1 
On the other hand, according to (3.7.1) withn = 1, I(@)) = —. 


260 
The results of this section are generalizable to families depending on & parameters 
(6,,..., 0). Under similar smoothness conditions, if ) = (A;,..., A,)/ is such that 
X'2X = 1, then 
09 + Av) — 00) ° 1 
lim / (V £0380 » v £2360) ) dulx) = 7X10), 3.9.6) 


where /(69) is the FIM. 
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PART II: EXAMPLES 


Example 3.1. Let X,..., X, be i.i.d. random variables having an absolutely con- 
tinuous distribution with a p.d.f. f(x). Here we consider the family F of all absolutely 
continuous distributions. Let T(X) = (Xqqy,..., Xm), where X(1) < +--+ < Xq), be 
the order statistic. It is immediately shown that 


1 
h(x | T(X) = t) = ne) =... Xn) = ty}. 


Thus, the order statistic is a sufficient statistic. This result is obvious because the 
order at which the observations are obtained is irrelevant to the model. The order 
statistic is always a sufficient statistic, when the random variables are i.i.d. On 
the other hand, as will be shown in the sequel, any statistic that further reduces the 
data is insufficient for F and causes some loss of information. | 


Example 3.2. Let X,,..., X, be ii.d. random variables having a Poisson dis- 
tribution, P(A). The family under consideration is F = {P(A);0 < 4 < oo}. Let 


T(X) = Sox: We know that T(X) ~ P(nd). Furthermore, the joint p.d.f. of X and 
i=l 


T(X) is 


| [x ! 


i-1 


Hence, the conditional p.d.f. of X given T(X) = ¢ is 


t! : 
Ax|t,Ay= > ifr: Son = 1}; 
| faite’ iat 
i=l 


where x;,...,X, are nonnegative integers and t = 0, 1, .... We see that the condi- 
tional p.d.f. of X given T(X) = ¢ is independent of A. Hence 7(X) is a sufficient 
statistic. Notice that X,,..., X, have a conditional multinomial distribution given 
UX; =. | 


Example 3.3. Let X = (X,,..., X,,)/ have a multinormal distribution N(21,, In), 
where 1, = (1, 1,..., 1)’. Let T = eee We set X* = (Xo,..., X,) and derive the 


i=l 
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joint distribution of (X*, T). According to Section 2.9, (X*, 7) has the multinormal 
distribution 


where 


Hence, the conditional distribution of X* given T is the multinormal 


N(XiAn-1 ’ Vn—-1)s 


n 
where X, = -S°X; is the sample mean and V,_; = J,-1 — Ly 1. It is easy 
ean n 
to verify that V,_; is nonsingular. This conditional distribution is independent 
of jw. Finally, the conditional p.d.f. of X; given (X*, 7) is that of a one-point 
distribution 


A(x | X*, T; 4) = {x2 x = T —X*1,-1). 


We notice that it is independent of uw. Hence the p.d.f. of X given T is independent 
of uw and T is a sufficient statistic. | 


Example 3.4. Let (X,, Y,), ..., (Xn, Yn) be 1.i.d. random vectors having a bivariate 
normal distribution. The joint p.d.f. of the n vectors is 


1 
(QnyPafos(1 — p2y"? | 


f(%, ¥3E,, 0,01, 02) = 
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where —00 < &,n< Ww; 0 <0), 02 < w; —I1 <p <1. This joint p.df. can be 
written in the form 


1 
QnyPorox(l — py? | 


I, y3é, 7,01, 02, p) = 


= n [ae 2p ED, Fa) 
P| 2a-pP)L 


on 0102 os 
1 |e > POLY) 1. all 


2(1 — p?) oa; 0107 Oe 


where xX = =x j= le Q(x) = YS oGi —x, Ay)= 201 — 3), 


i=1 i=l 
P(x, y) = Gi — Di — 5). 


i=l 
According to the Factorization Theorem, a sufficient statistic for F is 
T(X, Y) = (X, Y, Q(X), Q(Y), P(X, Y)). 


It is interesting that even if o, and oz are known, the sufficient statistic is still 
T(X, Y). On the other hand, if o = 0 then the sufficient statistic is T*(X, Y) = 
(X, Y, Q(X), O(Y)). a 


Example 3.5. 
A. Binomial Distributions 

F = {B(n, 0),0 <0 < 1}, n is known. X,,..., X, is a sample of 1.i.d. random 
variables. For every point x°, at which f(x°, 6) > 0, we have 


n a in 0) 
f0) _ n ) ( Q ye X;) 
rae Um 1-0 
x? 


Accordingly, this likelihood ratio can be independent of 6 if and only if yoni = 


i=1 


ee Thus, the m.s.s. is T(X) = oxi. 
i=l 


i=1 
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B. Hypergeometric Distributions 
X; ~ H(N,M, S),i = 1,...,n. The joint p.d.f. of the sample is 


oC). 


;N,M,S)= 
p(x ) I] WV 
t= 
S 
The unknown parameter here is M, M = 0,..., N. N and S are fixed known values. 
The minimal sufficient statistic is the order statistic T,, = (X(1), ..., Xin). To realize 


it, we consider the likelihood ratio 


(c) G2) 
p(x; N, M, S) aT eee 


Xj 
p(x;N,M,S) 7s i) C - o ; 


0 
x; 


This ratio is independent of (M) if and only if xj) = ays for alli = 1,2,...,n. 


C. Negative-Binomial Distributions 
X;~ NB, v),i=1,...,2;0<W <1,0<v<o. 


(i) If v is known, the joint p.d.f. of the sample is 


>: n 
ae ae er meer ae Cua ae 
BOM ay Nira ty’ 


n 
Therefore, the m.s.s. is JT, = ee 


i=l 
(ii) If v is unknown, the p.d_f.s ratio is 


n 


yri-Be Tl roe +1) T(x; + v) 
Tait1) Poetv) 


i=1 


Hence, the minimal sufficient statistic is the order statistic. 
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D. Multinomial Distributions 

We have a sample of n i.i.d. random vectors X = Ca az x): i=l,...,” 
Each X is distributed like the multinomial M(s,@). The joint p.d.f. of the 
sample is 


(i) 
x; 
; 2. 
(1) (n). — - 
D(X’,...,X =|] oe a anlle 


i=1 


Accordingly, an m.s.s. is Ty =(TO,..., 74), where TY? = Dx. j= 


k-1 
1,...,k— 1. Notice that T® = ns — 9° 7°. 
i=1 


E. Beta Distributions 


X; ~ B(p,g), i=1,....m; O<p,q<o. 


The joint p.d.f. of the sample is 


1 . p—1 —1 
f&% Pg) = = | fa? xt. 
B'(p,q) I] 


n n 
0 <x; <1 foralli=1,...,n. Hence, an m.s.s. is T, = (I. [a _ x0) In 
i=l i=l 
cases where either p or g are known, the m.s.s. reduces to the component of T,, that 
corresponds to the unknown parameter. 


F. Gamma Distributions 
X;~ GiA,v) i=l,...,n; O<A,v<m. 


The joint distribution of the sample is 


% v-l n 
A&A V)= can (MIs) oo |-adonf 
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Thus, if both 2 and v are unknown, then an m.s.s. is J), = (I. Si). If only 


i=1 i=1 
n 


v is unknown, the m.s.s. is T, = [ [%- If only 4 is unknown, the corresponding 


i=1 
n 


statistic yx ; 18 minimal sufficient. 


i=1 


G. Weibull Distributions 

X has a Weibull distribution if (X — &)* ~ E(A). This is a three-parameter family, 
6 = (&, A, «); where & is a location parameter (the density is zero for all x < &); a7! 
is a scale parameter; and @ is a shape parameter. We distinguish among three cases. 


(i) € and @ are known. 


Let ¥; = X; —&, i =1,...,n. Since Y* ~ E(A), we immediately obtain from 
that an m.s.s., which is, 


Tm = > ¥; = > (X; - 8)". 
i=1 i=1 


(ii) If @ and 2 are known but € is unknown, then a minimal sufficient statistic is 
the order statistic. 


(iii) aw is unknown. 


The joint p.d.f. of the sample is 
FO E,a) = a"a" | [0 — &)% exp {-. Yai - o*| 5 BSE 
i=1 i=l 


foralli = 1,...,n.By examining this joint p.d.f., we realize that a minimal sufficient 
statistic is the order statistic, i.e., J, = (X(1),-.., X(n))- 


H. Extreme Value Distributions 
The joint p.d.f. of the sample is 


f(%& A, a) = A" 00" exp {- yr -ayen| 


i=l i=1 


n 
Hence, if a is known then T,, = Se is a minimal sufficient statistic; otherwise, 
i=l 
a minimal sufficient statistic is the order statistic. 
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Normal Distributions 
(i) Single (Univariate) Distribution Model 


X; ~ N(E, 0°), i=1,....n; -o<E<w, I0<0 <M. 


n n 
The m.s.s. is TJ, = (x. yx?) If € is known, then an m.s.s. is 
i=l i=l 


n 
Six ir £); if o is known, then the first component of T,, is sufficient. 
i=l 


(ii) Two Distributions Model 


We consider a two-sample model according to which X,,..., X, are i.i.d. having 
aN€, 07) distribution and Yj,..., Y,, are iid. having a N(n, 03) distribution. The 
X-sample is independent of the Y-sample. In the general case, an m.s.s. is 


m 


T= 32x 2% x2, OY? 
i=l j=l i=l j=l 


m 


n m n 
If o2 = o? then the m.s.s. reduces to T* = Ea OF eS Ye 1 On 
1 2 i j 
i=l j=l i=l j=l 


the other hand, if § = 7 but 0; 4 o2 then the minimal statistic is T. | 


2 
Example 3.6. Let (X, Y) have a bivariate distribution NV 51 : oY with 
& 0) Oe 


—00 < &, & < 00;0 < oj, 0 < ov. The p.d_f. of (X, Y) is 
F(x, 3&1, &2, 01, 02) 
: §1 §2 It 4 Tiss) SBE Gis 
= ex x+ x + . 
210102 P | o; oe 20? 202° 2 oa? oO; 


This bivariate p.d.f. can be written in the canonical form 


1 
FG YS Wiy +5 Wa) = 5 expl ix + Way + Wax? + Way? — KW, ..., Wa} 


where 
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and 


__1/87 , BY, 1 I : 
Kiss Wa) = (jet ye) ta 22( a) *#(-a,) 


Thus, if wi, Wo, wW3, and wa are independent, then an m.s.s. is T(X)= 


n 
(sx. YY, SG, sy). This is obviously the case when &,, &2, 01, o2 can 
i=1 i=l i=l i=l 

assume arbitrary values. Notice that if | = & but o, 4 o> then w,,..., wa are still 
independent and 7(X) is an m.s.s. On the other hand, if &; 4 & but o, = o> then an 


M.5S.S. 1S 


T*(X) = (x Xi, y Yi, Yi? = 9) ; 
i=l i=l i=l 


The case of & = 2,01 # 02, is acase of four-dimensional m.s.s., when the parameter 
space is three-dimensional. This is a case of a curved exponential family. a 


Example 3.7. Binomial Distributions 
F = {B(n, 0);0 < @ < 1},n fixed. Suppose that E,{h(X)} = 0 forall0 < 6 < 1. 
This implies that 


na) wv =0, ally, 
j=0 


0 < yw <oo, where y =6/(1—@) is the odds ratio. Let dy; = h(iy(;), j= 
0,...,n. The expected value of h(X) is a polynomial of order n in yw. Accord- 
ing to the fundamental theorem of algebra, such a polynomial can have at most n 
roots. However, the hypothesis is that the expected value is zero for all yy in (0, oo). 


Hence a,,; = 0 for all j = 0,...,n, independently of yw. Or, 
Po{h(X) = 0} = 1, all 6. 
Example 3.8. Rectangular Distributions 
Suppose that F = {R(0,0);0 < 0 < ow}. Let Xj,..., X, be iid. random vari- 


ables having a common distribution from F. Let X(,) be the sample maximum. We 
show that the family of distributions of X(,), F;*, is complete. The p.d.f. of X(n) is 


n n—-1 
be ered , OK<t<8. 
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Suppose that Ey {h(X(,))} = 0 for all 0 < 6 < o. That is 


6 
i h(t)t” 'dt =0, forall 0, 0<@<o. 
0 


Consider this integral as a Lebesque integral. Differentiating with respect to 6 yields 
h(x)x" |=0, as. [Po], 
6 € (0, oo). = 


Example 3.9. In Example 2.15, we considered the Model II of analysis of variance. 
The complete sufficient statistic for that model is 


T(X) = (X, S?, S?), 
k n 


72 SX — X;)’ is the “within” sample 


i=l j=l 


1 


where X is the grand mean; $2, = ———_ 
k(n — 1) 


variance; and Sy a qk c= x y is the “between” sample variance. Employ- 
i 


ing Basu’s Theorem we can immediately conclude that X is independent of (S2, S?). 
Indeed, if we consider the subfamily #,,, for a fixed o and p, then Nisa complete 
sufficient statistic. The distributions of Se and Se, however, do not depend on wp. 
Hence, they are independent of X. Since this holds for any o and p, we obtain the 


result. | 


Example 3.10. This example follows Example 2.23 of Barndorff-Nielsen and Cox 
(1994, p. 42). Consider the random vector N = (Nj, No, N3, Na) having a multino- 
mial distribution M(n, p) where p = (p1,..., p4) and, for0 < 6 < 1, 


1 

P= 6! —@) 

p2= et +6) 
6 

P3= to 0) 
6 

pa = lo4 0). 
6 


The distribution of N is a curved exponential type. N is an m.s.s., but N is incomplete. 


Indeed, Eg {Ni + No — “I = 0 for all 0 < 6 <1, but Po {M+ a= =| ene 
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1 
all 6. Consider the statistic Ay = Nj + No. Ay ~ B (x. z) Thus, A, is ancillary. 
The conditional p.d.f. of N, given A; = a is 


a\ (1-0\" (1+0\o™ 
eee) or) 
‘ee DAHON fo ON ee 
1 ecw a ae 


forn,; = 0,1,...,a;n3 =0,1,...,n —a;n. =a —n, andny =n — a — n3. Thus, 
N, is conditionally independent of N3 given A; = a. a 


Example 3.11. A. Let X ~ B(n, 6), n known, 0 < 6 < 1; f(x;0) = ("Jera _ 
x 
0)"™~ satisfies the regularity conditions (3.7.2). Furthermore, 


0<6é<1. 


0 x n-x 
© tog fxsa) = 2 — 24, 
99 ELEM = GT G 


Hence, the Fisher information function is 


X n-xy7 
10)= 6 [5 - | | =n/a0-0, 0<6<1. 


B. Let X1,..., X, be ii.d. random variables having a rectangular distribution 
R(O, 0), 0 < @ < oo. The joint p.d_-f. is 


1 
S(% 8) = ga [Xn < 6}, 


where X() = max {x;}. Accordingly, 
<i<n 


P 
55 08 f(%:8) = —F 1X0 <6} 


and the Fisher information in the whole sample is 


ne 
In(8) = 5: 0<0<oM. 
C. Let X ~ w+ GCL, 2), —oo < yu < o. In this case, 


fxs mw) = (x — we FMI = p). 
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Thus, 

3 2 1 2 

log f(x;u)) =(-———+1) . 

Ou x— UL 
But 

A 

(xX — jp)? 1 

Hence, /(«z) does not exist. a 


Example 3.12. In Example 3.10, we considered a four-nomial distribution with 
parameters p;(@), i = 1,...,4, which depend on a real parameter 0, 0 <6 < 1. 
We considered two alternative ancillary statistics A = N,; + N2 and A! = N; + Ng. 
The question was, which ancillary statistic should be used for conditional inference. 
Barndorff-Nielsen and Cox (1994, p. 43) recommend to use the ancillary statistic 
which maximizes the variance of the conditional Fisher information. 

A version of the log-likelihood function, conditional on {A = a} is 


1@ | a) = N, log — 6)4+ (a — N,) log + 6) 
+ N3log(2 — 6) + (n — a — N3)log(2 + @). 


This yields the conditional score function 


a 
50 | a) = 10 | a) 


2 4 a n—-a 


=i N. m9 sf 
repr 5 age i. Daag 


The corresponding conditional Fisher information is 


Ch =e : 
IO Goo da ea ae. 


1 
Finally, since A ~ 6 (° ;) the Fisher information is 
I,(0) = E{I@ | A)} 


7 2-6? 
~ "04 — 82)" 
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In addition, 


2n 
(1 — 67)?(4 — 6?) 


V{I@ | A} = 


In a similar fashion, we can show that 


n 260A’ 


HO Q-o01+6) | @—6(4— 62 


and 


no? 


VUI@| A= (184 — 62" 


Thus, V{/(@ | A)} > V{Z@ | A} for all O < 6 < 1. Ancillary A is preferred. | 


Example 3.13. We provide here a few examples of the Kullback—Leibler information 
function. 


A. Normal Distributions 


Let F be the class of all the normal distributions {N(y1, 07); —oo < pp < 00,0 < 
o < oo}. Let 0, = (uj, 0;) and 05 = ({12, 02). We compute /(6,, 0). The likelihood 


ratio is 
FOOD 02 1 (<") Ce) 
f(x302) 4 a ee on On 


Thus, 


Joo fos 81) _ (2) l (<") (=) 
og = log : 
F (x36) a) 21\ a a2 


Obviously, Eg, {(4=“1)?} = 1. On the other hand 


o! 


X-m\ o\  f(uitoU —p2)) (141 — bo)? 
E6, {| ——— =(—) Es, 7 =— (I+ 5 ; 
02 02 Oo; 04 07} 


where U ~ N(0, 1). Hence, we obtain that 


j 7 0? 1 (onl s (1 — [2)” 
(91, 02) = log O1 + 2 0 1 ae o2 ¥ 
2 


228 SUFFICIENT STATISTICS AND THE INFORMATION IN SAMPLES 


We see that the distance between the means contributes to the K—L information 
function quadratically while the contribution of the variances is through the ratio 
p = 02/0}. 


B. Gamma Distributions 
Let 6; = (A;, v;), i = 1, 2, and consider the ratio 


Fe) _ Px) 2! 
FOs®) Pe) ay 


x"! exp{—x(A1 — A2)}. 


We consider here two cases. 


Case I: vj = v2 = v. Since the vs are the same, we simplify by setting 6; = A; 
(i = 1, 2). Accordingly, 


T(Aq, A2) = Ea, {toe (= ) + (Az -anx} 


=f) +(-9] 


This information function depends on the scale parameters 4; (i = 1, 2), through their 
ratio 0 = Az/A4. 


Case II: 4; = A2 = A. In this case, we write 


P12) 
T(v1, v2) = log — (42 — v))loga + Ey, {(1 — v2) log X}. 
Pr) 
Furthermore, 
E,, {log X 1 eer re | 
log X) = wf (log x)x x 


d 
= — logI(1;) — log. 
dv, 


The derivative of the log-gamma function is tabulated (Abramowitz and Stegun, 
1968). a 
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Example 3.14. Consider the normal distribution N(u, 07); —oo < pp < 00, 0 < 


a? < co. The score vector, with respect to 0 = (, a”), is 


X—p 
2 
S(0; X) = : 
1 (X — p)? 
202 204 
Thus, the FIM is 
: 0 
a2 
Muoy=| 7 
204 


We have seen in Example 2.16 that this distribution is a two-parameter expo- 


nential type, with canonical parameters yy = Ze and w= —=—,. Making the 


reparametrization in terms of Ww, and y, we compute the FIM as a function of yy, Wo. 
The inverse transformation is 


a= Mt 
Wr" 
Mot os, 
2W2 
Thus, 
5s ke a Va 
2y. 2W3 
DW) = 1 
Us + tag 
25 


Substituting (3.8.9) into (3.8.8) and applying (3.8.7) we obtain 


Iw va) = Dol : Jou 
1s Yo = 0 2y2 
Ww a 4W3 Wi 
_ | 32vp 32yy 
7 Wy I 


323 32 


2 
Notice that py? — 4y3 = S PG 
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Example 3.15. Let (X,Y) have the bivariate normal distribution 
N (o.0°( | ‘)): 0<02<ow, -l< o<1l. This is a _ two-parameter 
exponential type family with 


1 
f@ y= 5, OPW, y) + W2U2(x, y) — Kn, Wa)}, 


where Ui (x, y) = x4 _ U2(x, y) = xy and 


ey tah 2 2 
KW, W2) = 5 log(4wy — yy). 


The Hessian of K (1, W) is the FIM, with respect to the canonical parameters. We 
obtain 


4(4y? 2) _8 
Ih, 2) = (47 + v2) ae 


1 
(407 — WP 82 Wi ty 
Using the reparametrization formula, we get 


1 p 
o4 o7(1— p?) 
p 1+? 
o7(1 — p?) (l= 7) 


Notice that in this example neither y,, wv nor 07, p are orthogonal parameters. 


I(o?, p) = 


2/A Xr 
Example 3.16. Let F ={E(A),0 <A < co}. p2(A1, a2) =2 (: nee 2), 

Ay + A2 
Notice that A), A. < See 2 or all 0 < Ay, Ao < 00. If A, = Ao then p(Ay, Ao) = 


0. On the other hand, 0 < “aa, A2) < 2 for all O < Ay, Ax < co. However, for A; 
fixed 


lim p?(Aj, Az) = 2. 
hoa 00 


| 
PART III: PROBLEMS 
Section 3.2 
3.2.1 Let X,,..., X, be iid. random variables having a common rectangular 


distribution R(0), 62), —co < 0) <#, < ~@. 


(i) Apply the Factorization Theorem to prove that Xj) = min{X;} and 
Xn) = max{X;} are sufficient statistics. 


(ii) Derive the conditional p.d.f. of X = (X1,..., Xn) given (X(1), X(n)- 
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3.2.2 Let X;, X2,..., X,, bei.i.d. random variables having a two-parameter expo- 
nential distribution, i.e., X ~ w+ E(A), —co < uw < ~w, 0 <A < oo. Let 


Xa) < +++ < X, be the order statistic. 
n 


(i) Apply the Factorization Theorem to prove that X() and S = Six (i) — 
i=2 
X1y) are sufficient statistics. 
(ii) Derive the conditional p.d.f. of X given (X(1), S). 
(iii) How would you generate an equivalent sample X’ (by simulation) when 
the value of (X,(1), S) are given? 


3.2.3 Consider the linear regression model (Problem 3, Section 2.9). The unknown 
parameters are (a, 6, 0). What is a sufficient statistic for F? 


3.2.4 Let X,,..., X, bei.i.d. random variables having a Laplace distribution with 
p.d.f. 
; 1 |x — p| ; 
fasp,o)= exp ; O<x <0; -~O<"L< OH; 
20 Oo 


0 <o < ow. What is a sufficient statistic for F 
(i) when yu is known? 
(ii) when jz is unknown? 


Section 3.3 
3.3.1 Let X,,..., X, be ii.d. random variables having a common Cauchy distri- 
bution with p.d.f. 
aq-1 
1 x—p 
foinar= 2 [1 +( )] , —0OO<x < 00; 
on oO 


—00 < Lb < &,0 <o < oo. What is anm.s.s. for F? 


3.3.2 Let X,,..., X, be iid. random variables with a distribution belonging to a 
family F of contaminated normal distributions, having p.d.f.s, 


f(x;a, wo) = (1 — a) 


-~wO <w<w;0<0 <0w;0<a< 10-2. What is an m.s.s. for F? 
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3.3.3 Let X,,..., X, be iid. having a common distribution belonging to the 
family F of all location and scale parameter beta distributions, having the 
p.d.f.s 

1 x—p\?! xa ny 
> Us ’ = 1 =F ’ 
S(%3 Ms O Ps) aol = ) 7 
—w<sx<uto;-KwW<U<wW;0<0 <w;0<p,q<o. 
(i) What is an m.s.s. when all the four parameters are unknown? 
(ii) What is an m.s.s. when p, g are known? 
(iii) What is an m.s.s. when jz, o are known? 

3.3.4 Let X),..., X, be ii.d. random variables having a rectangular R(6), 62), 
—0O < 0; < @ < &. What is an m.s.s.? 

3.3.5 F is a family of joint distributions of (X, Y) with p.d.f.s 

g(x, y3A) = exp{—Ax — y/A}, O<A<ow. 
Given a sample of n i.i.d. random vectors (X;, Y;),i = 1,...,, what is an 
m.s.s. for F? 
3.3.6 The following is a model in population genetics, called the Hardy- 
3 
Weinberg model. The frequencies Ni, N2, N3, oN =n, of three geno- 
i=l 
types among n individuals have a distribution belonging to the family F of 
trinomial distributions with parameters (1, p1(@), p2(@), p3(@)), where 
pi@)=6", p2(0)= 2001-6), ps6) = (1-8), 3.3.1) 
0 <0 < 1. What is an m.s.s. for F? 
Section 3.4 

3.4.1 Let Xj,...,X;, be ii.d. random variables having a common distribution 

with p.d.f. 


k 
fp) = {yi <x < w2}h@)exp poe = wo} : 


i=3 


—cO < W < Ww < oo. Prove that T(X) = (x Xn), SY \Us(Xi), zaley 


i=l 
Suet) is an M.S.S. 


i=1 
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3.4.2 Let {(X;, ¥;),i =1,...,n} be iid. random vectors having a common 
bivariate normal distribution 


2 
5 E | o; POxOy 
1 POx0y o, 


; 2,2 
where —oo < €, < 030 <o,,0, 


<oo;-I<p<l. 
(i) Write the p.d.f. in canonical form. 


(ii) What is the m.s.s. for F? 


3.4.3 In continuation of the previous problem, what is the m.s.s. 
(i) when = 7 = 0? 
(ii) when o, = oy = 1? 


(iii) when € = 7 = 0,0, =o, = 1? 


Section 3.5 


3.5.1 Let F = {G%(A, 1);0 < a < 0,0 <A < oo} be the family of Weibull dis- 
tributions. Is F complete? 


3.5.2 Let F be the family of extreme-values distributions. Is F complete? 


3.5.3 Let F = {R(O, 62); —oo < 6; < Oo < oo}. Let X), Xo,..., Xn, n = 2, be 
a random sample from a distribution of F. Is the m.s.s. complete? 


3.5.4 Is the family of trinomial distributions complete? 


3.5.5 Show that for the Hardy-Weinberg model the m.s.s. is complete. 


Section 3.6 
3.6.1 Let {(X;, Y;), i=1,...,n} be iid. random vectors distributed like 


| p 
v(o.(/ {)) rise <i. 
(i) Show that the random vectors X and Y are ancillary statistics. 


(ii) What is an m.s.s. based on the conditional distribution of Y given 


{X = x}? 
3.6.2 Let X),...,X, be iid. random variables having a normal distribution 
N(2, 07), where both jz and o are unknown. 
M.-X . é 
(i) Show that U(X) = 0.20; is ancillary, where M, is the sample 
3—- C1 


median; X is the sample mean; Q, and Q3 are the sample first and 
third quartiles. 
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(ii) Prove that U(X) is independent of |X|/5, where S is the sample standard 
deviation. 


Section 3.7 


3.7.1 


3.7.2 


3.7.3 


3.7.4 


3.7.5 


Consider the one-parameter exponential family with p.d.f.s 
f (x30) = h(x) exp({U)y@) — K@)}. 

Show that the Fisher information function for 6 is 

K'(6) 

w'(@) 


Check this result specifically for the Binomial, Poisson, and Negative- 
Binomial distributions. 


1(0) = K"(@)— w"(@) 


Let (X;, Y;), i= 1,..., be iid. vectors having the bivariate standard 
normal distribution with unknown coefficient of correlation p, —1 < p < 1. 
Derive the Fisher information function [,,(p). 


Let ¢(x) denote the p.d.f. of N(0, 1). Define the family of mixtures 
f(x;a) =ag(x)+ U1 -a)ox—-1), Osa. 
Derive the Fisher information function J(q@). 


Let F = { f(x; w), —o0 < W < oo} bea one-parameter exponential family, 
where the canonical p.d.f. is 


Fes W) = hx) exp{w U(x) — K()}. 


(i) Show that the Fisher information function is 


(Wy) = K"(p). 


(ii) Derive this Fisher information for the Binomial and Poisson distribu- 
tions. 


Let Xj,..., X, beiid. N(0, 07), 0 <a* <o. 
(i) What is the m.s.s. 7? 
(ii) Derive the Fisher information /(o7) from the distribution of T. 


, 1 
(iii) Derive the Fisher information 7° (a7), where S* = eee _ 
nh 
i=l 


X) is the sample variance. Show that [5 (02) < I(o?). 
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3.7.6 Let(X, Y) have the bivariate standard normal distribution NV (0 ( ; ‘ ) : 
—1 <p <1.X isan ancillary statistic. Derive the conditional Fisher infor- 
mation /(o | X) and then the Fisher information /(:). 


3.7.7 Consider the model of Problem 6. What is the Kullback—Leibler information 
function /((1, $2) for discriminating between ¢; and p2 where —1 < p; < 
p2 <1. 


3.7.8 Let X ~ P(A). Derive the Kullback—Leibler information J(A;, A2) for 0 < 
1, A2 < &. 


3.7.9 Let X ~ B(n,@). Derive the Kullback—Liebler information function 
(6, 62), 0< 01, 0 <i. 


3.7.10 Let X ~ G(A, v),0 < X < o, v known. 
(i) Express the p.d.f. of X as a one-parameter canonical exponential type 
density, g(x; y). 
(ii) Find W for which g(x; y) is maximal. 
(iii) bing the Kullback—Leibler information function I({, y) and show that 


0 ig 7 ov 
al w= Opa 


Section 3.8 


3.8.1. Consider the trinomial distribution M(n, p,, p2), 0 < pi, po, Pi + p2 < 1. 
(i) Show that the FIM is 


1— po 
n P\ 
T(p1, p2) = ————_ 
1— pi — p2 ; 1- p; 
P2 


1 


(ii) For the Hardy—Weinberg model, p|(@) = 6, p2(0) = 20(1 — 8), derive 
the Fisher information function 


10) = 2n 
~ 61-6) 


3.8.2 Consider the bivariate normal distribution. Derive the FIM /(&, 7, 01, 02, (). 


3.8.3 Consider the gamma distribution G(A, v). Derive the FIM /(A, v). 
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3.8.4 Consider the Weibull distribution W(A, a) ~ (G(A, 1))!/?; 0 < a, A < 00. 
Derive the Fisher informaton matrix [(A, a). 


Section 3.9 


3.9.1 Find the Hellinger distance between two Poisson distributions with param- 
eters A, and A>. 


3.9.2 Find the Hellinger distance between two Binomial distributions with param- 
eters p; # p2 and the same parameter n. 


3.9.3 Show that for the Poisson and the Binomial distributions Equation (3.9.4) 
holds. 
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3.2.1 X,,...,X, areiid.~ R(0;, 02),0 < 0; < 6) < ~w. 
(i) 


1 


X1,..., X,30) = ———— 
SCX ) @ oF 


I] 1(0, < X; < 6) 
i=1 
2 TO. < Xay < Xi <2) 
= —___ < < Xin) < . 
aoa a le 


Thus, f(X1,..., Xn30) = A(x)g(T(x), 0), where A(x) = 1 V x and 


1 
T(x); 0) = ——_1(@, < Xqy < Xqy < 4). 
g(T (x); 4) ( 6," (; (1) (n) 2) 
T(X) = (Xa), Xq)) is a likelihood statistic and thus minimal sufficient. 
(ii) The p.d.f. of (Xa), Xn)) is 


h(x, y)=n(n — yO="* 16, <x <y <6). 
(02 — 01)" 


Let (X(1),..., Xn)) be the order statistic. The p.d.f. of (Xq1y,..., Xi) 
is 


n} 
P(X1,..., Xn30) = ———_1(0 x 11+ < X, < 02). 
(X1 ) @ ay < 1<-+++ < Xn < 4) 


The conditional p.d.f. of (X(1, ..., X(y) given (X(1), Xq~)) 18 


P(X1, «++, Xn38) —  (n— 2)! 
A(X, Xn39) > (Xn a X )"-2 


I(X, < X. < +++ < Xy_1 < Xy). 
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That is, (X(2),..., X(@m—1)) given (X(), X()) are distributed like the 
(n — 2) order statistic of (n — 2) i1.1.d. from R(X(1), Xn). 


3.3.6 The likelihood function of 0,0 < @ < 1, is 
L(@;n, Ny, No) = Q2Ni+N2(] _ Q)2Ns+No 


Since N3 =n — N,; — No, 2N3 + No = 2n — 2N, — No. Hence, 


2Ni+N2 
L(@;n, Ni, No) = (S ;) (i —ey". 


The sample size n is known. Thus, the m.s.s. is J, = 2N, + No. 


3.4.1 


i=3 


k 
forsW) = 1h <x < Whexp ps WiUi(x) - Kw} 


The likelihood function of wy is 


L(w;X) = {wy < Xa) < Xw < Wo} 


k n 
- exp S Wi > U;(X;) —nK(w) 


i=3 j=l 


n n 

Thus | Xa), Xqn), YS “us(x ties SoU(x ;) | isa likelihood statistic, i.e., 
j=l j=l 

minimal sufficient. 


3.4.2 The joint p.d-f. of (X;, Y;),i = 1,...,n is 


1 
QnyPafof(l — py? 


1 SW Cee at ".X;-& Yi-n 
00 sata [ie (Mest) eet 


i=1 


+E]. 


f(X, ¥50) = 
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In canonical form, the joint density is 


1 n n n 
f& Ys) = aero Yo Xi tv Dov +03 >) Xx? 


i=-1 i=1 i=1 


+a Y7 +s 0 Xi - ro) 


i=l i=1 


where 
7 3 np 
y= 2 2 2 
ox(1—p?)  oxoy(1— p*) 
- 1 Ep 
= dp) oxev— 
Gee 1 
* 2021 — p?) 
je 1 
* 263(1 = p?) 
_ p 
Lo geerlime) 
and 
2432 2232) tp) 1 
GeO ein) 


2070;(1 — p?) 2 
The m.s.s. for ¥ is T(X, Y) = (2X;, VY;, UX?, VY?, UX:Y;). 
3.5.5 We have seen that the likelihood of 6 is L(0) « 07% (1 — 6)?”-7™) where 


T(N) = 2N, + N2. This is the m.s.s. Thus, the distribution of T(N) is 
B(2n, 0). Finally F = B(2n, 0),0 < 6 < 1} is complete. 


3.6.2 
(i) 
M. ~ t+oM.(Z) 
X~ptoZ 
Q3 ~ 1+0Q3(Z) 
QO;~u+oQ(Z) 
M.—X M.(Z) — Z 


U(X) = ~ 
Q3—Qi Q3(Z)— Qi(Z) 


independent of uw ando. 
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3.7.1 


(ii) By Basu’s Theorem, U(X) is independent of (X, S), which is a complete 
sufficient statistic. Hence, U(X) is independent of |X|/S. 


The score function is 
S(0;X) = U(X)w'(0) — K'(6). 
Hence, the Fisher information is 


1(0) = Vo{S(0; X)} 
= (WO) Ve{U(X)}. 


Consider the equation 
Reet as =7, 


Differentiating both sides of this equation with respect to 0, we obtain that 


K'(@) 


E9{U(X)} = ——. 
o{U(X)} VO) 


Differentiating the above equations twice with respect to 0, yields 


! 2 2 ! 2 ” K'(6) " 
(W(O)) Ee{U(X)} = (K'(@) — We (O)—___ + K (8). 
wv’) 
Thus, we get 
K"(6) nen K'(O) 
Vo{U(X)} = 6 : 
ATCO Cap  * Ocre@y 
Therefore 
K'(@) 


1(0) = K"(@)-w"(@ : 
(9) (9) YOC® 


In the Binomial case, 


FG: 6) = (“Je reg tr log(1—6)_ 
Xx 
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Thus, (6) = log i a a” K(@) = —nlog(l — 8) 
(0) = —— i@yaoee ee 
eS 6(1— 6)’ vey 62(1 — 62 
Coa oa" 
~ 16" ~ 0-9 
n 
Hence, /(0) = ————. 
a(1 — 0) 


In the Poisson case, 


WA) = log(a), K(a)=A 
1 1 
VOSS: WW=—z, KA=1, KA) =0 
fost Pe soe 
A)= 7 nA) = re 


8 In the Negative-Binomial case, v known, 


W(p) =log(1— p), K(p)=—vlog(p), Mp) = ——— 
EU e) 


3.7.2 Let Oy = a ee Pyy = pea Ox => bee 
i=l i=l i=l 
Let /(e) denote the log-likelihood function of p, —1 < p < 1. This is 


fi 1 
K(p) = 5 log(1 — p”) x1 yey Ip Pry + Ox). 
Furthermore, 
"(py = MLO = (Ox + Ord + 307) + 2Pxv PB + 07). 


Gey 
Recall that J(o) = E{—l'(p)}. Moreover, 
E(Qx + Qy)=2n and E(Pyy)=np. 
Thus, 


n(l—p') _ n(1 +p?) 
~ =p =p? 
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3.7.7 
an~n(0.(/ an -l<p<l 
1 i 1 1 : 
F930) = Gas P( 7g ) aa aro( 20 — p>)” px) 
f@& yo) _ A= 7!” 1 Gas va n)| 
f(,y3o2) = pp)? 2\ 1- pf 1 — p3 
ioe f@, vie _ 1 (4) L(y px)’ | LQ = x)? 
f(x, y3p2) 2 l-pi/ 2 1-pi 2 1-p3 


Thus, the Kullback—Leibler information is 


f(X, =| 
F(X, Y); p2) 


1 1 — 3 
= — log ze F 
2 1—p; 


The formula is good also for p; > (2. 


I(p, P2) = Ep, {le 


3.7.10 The p.d.f. of G(A, v) is 
Xr” 
Pv) 


lee 


fA, Vv) = 


When v is known, we can write the p.d.f. as g(x; Ww) = h(x) exp(Wx + 


v-l1 
vlog(—y)), where h(x) = at 


g(x, w) is v= Ss The K-L information /(w, w) is 
x 


w = -—A. The value of yw maximizing 


toh = A + vtog (HZ). 


Substituting wv, = yr, we have 


o 
Thus, 
— ib, y) =-> - = 
aw oo 


Gays a 
pga = ya = 10). 
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3.8.2 The log-likelihood function is 


1 2 I 2 1 2 
ME, , 01, 02, P) = 5 los p’) 5 los(or) — 7 losto2) 
1 x—& or Y-7n z ee ee ee 
2(1 — p?) com O2 O71 on |. 


The score coefficients are 


al Ye fi Vax 
SKS ae ee ee Pee 
f 1 lol —p2) p* 0102 
al Y—n p X-é 
aa = elias way. (ge 
n os —p*) p> 0102 
al (X —&) p X-& Y-yn I 
Ba Sd =O 2 
i 1 p*) 21—p?) oF a0, 2a; 
al (Y —n)? p X-&Y-yn 1 
S4 = = 


do 2of(1— p2) 2-2) o1m =~ 207 


_a_ Pp p Rae of Fan 
a ram al o1 y+ Oo )] 


1+p? X-& Y—-n 
d-pyr oo 2 


The FIM 


P=), if = 06.55. 


iy = VS) = ———> 
ia 1 oI =) 


In = V(S2) = 
oO 


2 — p2) 
iOS; 3) Se 
o102(1 — p?) 
Baye = 
dof — p?) 
2 
ee 


4o3(1 — p?) 
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Q3 = N4= Is = 0 
h3 = hha = hs = 0 


2 


p 
T= Cots, GaSe 
34 = cov(S3, S4) Ae? =e?) 
ip eas) 
2o2(1 — p?) 
ie con SS 
202(1 — p?) 
and 
1+ 
Is5 = V(Ss5) = apy 


3.8.4 The FIM for the Weibull parameters. The likelihood function is 
LG W=iexte 
Thus, 
l(a, a) = log(A) + log(a) + log(X%) — 7X®. 


al 1 ¥ 
=—=--X 

Or xr 

al 1 
= — = — + log(X) — AX log(X) 

da a 


1 1 A 
= — + — log(X®) — —X* log(X”). 
a a a 


1 
Recall that X* ~ E(A). Thus, E{X°%} = < and E{S;} = 0. Let w(1) denote 


the di-gamma function at 1 (see Abramowitz and Stegun, 1965, pp. 259). 
Then, 


Eflog(X*)} = —log(a) + (1) 


1 1 
E{X* log X*%} = ney 7 los) — w()). 
Thus, 


1 1 
FAS =, Coe ed) 


ay fe Ma Xr 1 =0 
* (5 = (log(a) vay) =o. 
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1 
Ty = V(S1) = V{xX*} = 2 
Ty2 = cov(S,, S2) 


1 Xr 
= cov( X*, — log X* X"log X*) 
a a 
1 Xr 
= —-—cov(Y, log Y) + —cov(Y, Y log Y), 
a a 


where Y ~ X% ~ E(A). 


Eflog Y} = w(1) — log(a) 
E{(log Y)?} = WC) + (WC) — log a)? 


E{Y logY}= “1 + Ww) — logd) 
1 
E{Y(log Y)} = = (WD + 2h) = log a) + (WC) = log 2) 
E{Y? log Y} = si +2(1 + w(1) — loga)) 


2 
E{Y(log Y)?} = a) + (1+ wW(1) — log)’ + WC) — loga). 


Accordingly, 
1 
cov(Y, log Y) = 53 
2+ w(1) —loga 
cov(Y, Y log Y) = 43 
and 
1 
Ty = —U+ wd) — log). 
a 
Finally, 
Inn = V(S2) 


1 Xr 
=V{|-—logY — —YlogY 
a a 
1 ae 
= ~; V {log Y}+ at Me Y} 
a 


rv 
= 23 cow(log Y, Y log Y) 


i 
= (Wl) + 20H) + WU) — log) 


+ (1+ w(l) — log a) — 20p'(L) + WC) — log). 
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Thus, 


1 
In = aw") + (1+ W(1) — loga)’). 


The Fisher Information Matix is 


1 1 
— —(1+ w(1) — log) 
IQ. @) = 2 a 


1 
° on wD + (1 + w(1) — log 4)’) 


CHAPTER 4 


Testing Statistical Hypotheses 


PART I: THEORY 
4.1 THE GENERAL FRAMEWORK 


Statistical hypotheses are statements about the unknown characteristics of the distri- 
butions of observed random variables. The first step in testing statistical hypotheses 
is to formulate a statistical model that can represent the empirical phenomenon being 
studied and identify the subfamily of distributions corresponding to the hypothesis 
under consideration. The statistical model specifies the family of distributions rele- 
vant to the problem. Classical tests of significance, of the type that will be presented 
in the following sections, test whether the deviations of observed sample statistics 
from the values of the corresponding parameters, as specified by the hypotheses, 
cannot be ascribed just to randomness. Significant deviations lead to weakening of 
the hypotheses or to their rejection. This testing of the significance of deviations is 
generally done by constructing a test statistic based on the sample values, deriving 
the sampling distribution of the test statistic according to the model and the values 
of the parameters specified by the hypothesis, and rejecting the hypothesis if the 
observed value of the test statistic lies in an improbable region under the hypothesis. 
For example, if deviations from the hypothesis lead to large values of a nonnegative 
test statistic T(X), we compute the probability that future samples of the type drawn 
will yield values of T(X) at least as large as the presently observed one. Thus, if we 
observe the value fo of T(X), we compute the tail probability 


(to) = Po{T(X) = to}. 


This value is called the observed significance level or the P-value of the test. A 
small value of the observed significant level means either that an improbable event 
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has occurred or that the sample data are incompatible with the hypothesis being 
tested. If a(to) is very small, it is customary to reject the hypothesis. 

One of the theoretical difficulties with this testing approach is that it does not 
provide a framework for choosing the test statistic. Generally, our intuition and 
knowledge of the problem will yield a reasonable test statistic. However, the for- 
mulation of one hypothesis is insufficient for answering the question whether the 
proposed test is a good one and how large should the sample be. In order to con- 
struct an optimal test, in a sense that will be discussed later, we have to formulate an 
alternative hypothesis, against the hypothesis under consideration. For distinguishing 
between the hypothesis and its alternative (which is also a hypothesis), we call the 
first one a null hypothesis (denoted by Ho) and the other one an alternative hypoth- 
esis H,. The alternative hypothesis can also be formulated in terms of a subfamily 
of distributions according to the specified model. We denote this subfamily by F}. If 
the family Fo or F; contains only one element, the corresponding null or alternative 
hypothesis is called simple, otherwise it is called composite. The null hypothesis 
and the alternative one enable us to determine not only the optimal test, but also 
the sample size required to obtain a test having a certain strength. We distinguish 
between two kinds of errors. An error of Type I is the error due to rejection of the 
null hypothesis when it is true. An error of Type II is the one committed when the 
null hypothesis is not rejected when it is false. It is generally impossible to guarantee 
that a test will never commit either one of the two kinds of errors. A trivial test 
that always accepts the null hypothesis never commits an error of the first kind but 
commits an error of the second kind whenever the alternative hypothesis is true. Such 
a test is powerless. The theoretical framework developed here measures the risk in 
these two kinds of errors by the probabilities that a certain test will commit these 
errors. Ideally, the probabilities of the two kinds of errors should be kept low. This 
can be done by choosing the proper test and by observing a sufficiently large sample. 
In order to further develop these ideas we introduce now the notion of a test function. 

Let X = (X1,..., X,) be a vector of random variables observable for the purpose 
of testing the hypothesis Hp against H,. A function @(X) that assumes values in the 
interval [0, 1] and is a sample statistic is called a test function. Using a test function 
(X) and observing X = x, the null hypothesis Hp is rejected with probability (x). 
This is actually a conditional probability of rejecting Ho, given {X = x}. For a 
given value of #(x), we draw a value R from a table of random numbers, having 
a rectangular distribution R(O, 1) and reject Hp if R < (x). Such a procedure is 
called a randomized test. If (x) is either O or 1, for all x, we call the procedure a 
nonrandomized test. The set of x values in the sample space V for which $(x) = 1 
is called the rejection region corresponding to #(x). 

We distinguish between test functions according to their size and power. The 
size of a test function (x) is the maximal probability of error of the first kind, over 
all the distribution functions F in Fo, i.e., a = sup{E{@(X) | F}: F © Fo} where 
E{@(X) | F} denotes the expected value of @(X) (the total probability of rejecting 
Ho) under the distribution F. We denote the size of the test by a. The power of 
a test is the probability of rejecting Hp when the parent distribution F belongs to 
F,. As we vary F over Fi, we can consider the power of a test as a functional 
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w(F;@) over F,. In parametric cases, where each F can be represented by a real 
or vector valued parameter 0, we speak about a power function (6; ¢), 0 € ©), 
where ©, is the set of all parameter points corresponding to F;. A test function 
°(x) that maximizes the power, with respect to all test functions ¢(x) having the 
same size, at every point 9, is called uniformly most powerful (UMP) of size a. 
Such a test function is optimal. As will be shown, uniformly most powerful tests 
exist only in special situations. Generally we need to seek tests with some other 
good properties. Notice that if the model specifies a family of distributions F that 
admits a (nontrivial) sufficient statistic, T(X), then for any specified test function, 
@(X) say, the test function d(T) = E{(X) | T} is equivalent, in the sense that it has 
the same size and the same power function. Thus, one can restrict attention only to 
test functions that depend on minimal sufficient statistics. 

The literature on testing statistical hypotheses is so rich that there is no point to 
try and list here even the important papers. The exposition of the basic theory on 
various levels of sophistication can be found in almost all the textbooks available on 
Probability and Mathematical Statistics. For an introduction to the asymptotic (large 
sample) theory of testing hypotheses, see Cox and Hinkley (1974). More sophisti- 
cated discussion of the theory is given in Chapter III of Schmetterer (1974). In the 
following sections we present an exposition of important techniques. A comprehen- 
sive treatment of the theory of optimal tests is given in Lehmann (1997). 


4.2 THE NEYMAN-PEARSON FUNDAMENTAL LEMMA 


In this section we develop the most powerful test of two simple hypotheses. Thus, 
let F = {Fo, F\} be a family of two specified distribution functions. Let fo(x) and 
fi(x) be the probability density functions (p.d.f.s) corresponding to the elements 
of F. The null hypothesis Ho is that the parent distribution is Fo. The alternative 
hypothesis H is that the parent distribution is F;. We exclude the problem of testing 
Hp at size a = 0 since this is obtained by the trivial test function that accepts Ho with 
probability one (according to Fo). The following lemma, which is the basic result of 
the whole theory, was given by Neyman and Pearson (1933). 


Theorem 4.2.1. (The Neyman—Pearson Lemma) For testing Ho against H,, 
(a) Any test function of the form 


1, if fi(X) > kfo(X) 
P(X) = fy, if fi(X) = kfo(X) (4.2.1) 


0, otherwise 


for some 0 <k < wand0 < y < 1 is most powerful relative to all tests of its size. 

(b) (Existence) For testing Ho against H,, at a level of significance a there exist 
constants ky, 0 < ky < © and yy, 0 < Yy < 1 such that the corresponding test 
function of the form (4.2.1) is most powerful of size a. 
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(c) (Uniqueness) If a test ¢' is most powerful of size a, then it is of the form (4.2.1), 


except perhaps on the set {x; f\(x) = kfo(x)}, unless there exists a test of size smaller 
than a and power 1. 


Proof. (a) Let a be the size of the test function p(X) given by (4.2.1). Let o'(x) be 
any other test function whose size does not exceed q, i.e., 


Eo{o'(X)} <a. (4.2.2) 


The expectation in (4.2.2) is with respect to the distribution Fo. We show now that 
the power of #!(X) cannot exceed that of °(X). Define the sets 


R~ = {x3 file) < kfo(x)} 
R° = {x; fix) = kfo(x)} (4.2.3) 
Rt = {x; filx) > kfo(x)}. 


We notice that {R~, R°, R*} is a partition of x. We prove now that 
o.e) 
i (f'(x) — $°(x)) fi@rdu(x) < 0. (4.2.4) 
—0o 


Indeed, 


; (o'(x) — 6°) (fix) — kfolx))d u(x) 
Me (4.2.5) 


7 (/ ae Ih. % I.) (O10) = POF) — kfola))d u(x). 


Moreover, since on R~ the inequality f\(x) — kfo(x) < 0 is satisfied and (x) = 0, 
we have 


[eo — $(x))( file) — kfox))du(x) < 0. (4.2.6) 
Similarly, 

[eo — $°(x))( fix) — kfolx))du(x) = 0 (4.2.7) 
and since on Rt $°(x) = 1, 


[eo — P(x) fie) — kfo())d u(x) < 0. (4.2.8) 
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Hence, from (4.2.6)—-(4.2.8) we obtain 


/ (6'(x) — 6a) fia)du(x) < k if ($x) — 6%(x)) folxdd lx) <0. (4.2.9) 


The inequality on the RHS of (4.2.9) follows from the assumption that the size of 
(x) is exactly a and that of o!(x) does not exceed aw. Hence, from (4.2.9), 


i b (x) filx)du(x) < / b°(x) filx)d u(x). (4.2.10) 


This proves (a). 
(b) (Existence). Consider the distribution W(é) of the random variable 
Fi(X)/fo(X), which is induced by the distribution Fp, i.e., 


4.2.11 
fo(X) ‘ d 


We notice that Po{ fo(X) = 0} = 0. Accordingly W(&) is ac.d.f. The y-quantile of 
Wé&) is defined as 


wey= {A> <el. 


W-'(y) = inf{E; WE) = y}. (4.2.12) 


For a given value of a,0 < a < 1, we should determine 0 < ky < co and0 < py < | 
so that, according to (4.2.1), 


o = Eo{$°(X)} = 1 — Wka) + YalW (ka) — Wka — 0D], (4.2.13) 
where W(k,) — W(ka — 0) is the height of the jump of W(&) at ky. Thus, let 
ky = W (1 —@). (4.2.14) 


Obviously, 0 < ky < oo, since W(&) is ac.df. of a nonnegative random variable. 
Notice that, for a given 0 < a < 1, ky = 0 whenever 


x 
| BE = of > 1a 
fo(®) 
If W(k,) — Wk, — 0) = 0 then define y, = 0. Otherwise, let yy be the unique 
solution of (4.2.13), i.e., 


_ Wlka) — 1 — @) 
Yo = Wk.) Wk, — 0)’ (4.2.15) 


Obviously, 0 < yy < 1. 
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(c) (Uniqueness). For a given q, let $°(X) be a test function of the form (4.2.1) 
with ky and y, as in (4.2.14)-(4.2.15). Suppose that @!(X) is the most powerful test 
function of size a. From (4.2.9), we have 


[ee — $'(x))(filX) — ka folx))du(x) = 0. (4.2.16) 
But, 
[ eco fooas = [ eeorfooducs (4.2.17) 
and since ¢° is most powerful, 


i b°(x) filxdx = / ob! (x) filx)d (a). 


Hence, (4.2.16) equals to zero. Moreover, the integrand on the LHS of (4.2.16) is 
nonnegative. Therefore, it must be zero for all x except perhaps on the union of R° and 
aset N of probability zero. It follows that on (R* — N)U(R7 — N), (x) = (x). 
On the other hand, if #!(x) has size less than w and power 1, then the above argument 
is invalid. QED 


An extension of the Neyman—Pearson Fundamental Lemma to cases of testing m 
hypotheses Mj,..., H,, against an alternative H,,4; was provided by Chernoff and 
Scheffé (1952). This generalization provides a most powerful test of H,,1 under the 
constraint that the Type I error probabilities of H,,..., H,, do not exceed aj, ..., @m, 
correspondingly where 0 < a; < 1,i = 1,...,m.See also Dantzig and Wald (1951). 


4.3. TESTING ONE-SIDED COMPOSITE HYPOTHESES 
IN MLR MODELS 


In this section we show that the most powerful tests, which are derived according to 
the Neyman—Pearson Lemma, can be uniformly most powerful for testing composite 
hypotheses in certain models. In the following example we illustrate such a case. 

A family of distributions F = {F(x;0),@ € ©}, where © is an interval on the 
real line, is said to have the monotone likelihood ratio property (MLR) if, for every 
0, < Q> in O, the likelihood ratio 


f(x; 02)/f (x3 1) 


is a nondecreasing function of x. We also say that F is an MLR family with respect 
to X. For example, consider the one-parameter exponential type family with p.d_f. 


f(x; 0) = h(x) exp{aU(x) — K(0)},  -w <A <Ow. 


This family is MLR with respect to U(X). 
The following important lemma was proven by Karlin (1956). 
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Theorem 4.3.1 (Karlin’s Lemma). Suppose that F = {F(x;0);—co < 0 < oo} 
is an MLR family w.rt. x. If g(x) is a nondecreasing function of x, then Eg{g(X)} 
is a nondecreasing function of 0. Furthermore, for any 0 < 0’, F(x;0) > F(x;0') 
for all x. 


Proof. (i) Consider two points 6, 6’ such that 0 < 6’. Define the sets 


A = {x; f(x; 0’) < f(x;6)} 


(4.3.1) 
B= {x; f(x30) > f(x; 0)} 


where f(x; @) are the corresponding p.d.f.s. Since f (x; 0’)/f (x; @) is a nondecreasing 
function of x, if x € A and x’ € B then x < x’. Therefore, 


a = sup g(x) < inf g(x) = b. (4.3.2) 
xeA xeB 
We wish to show that Eg {g(X)} => Eo{g(X)}. Consider, 


[ source) — f(x; @)Jdu(x) 
(4.3.3) 
= f seorsesie — forondncay + | gL FOO) — fs O)du(x). 
A B 


Furthermore, since on the set A f(x; 0’) — f(x;6) < 0, we have 


i e(x)LF 36") — f(s O)ldpa) > ii LF(a8) — fs @)ldplx). (4.3.4) 
Hence, 
/ e(x)Lf(r:6) — fOr: 6)ldp(x) > a / F(x") — for: @)\dw(x) 
- (4.3.5) 
icf [ Lf) — f(x: 8) dul). 


Moreover, for each 6, 


| f(x3 O)d u(x) + ‘i f(x; O)du(x) = 1 — Pal f(x 0) = fxs 0)). 
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In particular, 


i f(x30)d u(x) = -| fxs O)du(x) +1 — Pol f (x50) = f(x; 9}, 
. a (4.3.6) 
=f feeioducsy = ff peeve rdycay — 1+ Pal fo) = f(x; 6)). 


This implies that 


[tree — fxs O)|du(x) = — [rece — fxs @)|dux). (4.3.7) 
Moreover, from (4.3.5) and (4.3.7), we obtain that 


Eo {g(X)} — Eo{g(X)} = (b - a) [ 1f6) — f(x; @)Jdu(x) = 0. (4.3.8) 


Indeed, from (4.3.2), (b — a) > 0 and according to the definition of B, / Lf (x3 6") — 
B 


f(x; @)]du(x) = 0. This completes the proof of part (i). 

(ii) For any given x, define ¢,(y) = I{y; y > x}. &,(y) 1s anondecreasing function 
of y. According to part (i) if 0’ > @ then Eg{d,(Y)} < Eg {b,(Y)}. We notice that 
Eo{o,(Y)} = PofY > x} = 1— F(x;0). Thus, if 6 < 0’ then F(x;0) > F(x; 6’) for 
all x. QED 


Theorem 4.3.2. Jf a one-parameter family F = {F4(x);—co < @ < co} admits a 
sufficient statistic T (X) and if the corresponding family of distributions of T(X), F', 
is MLR with respect to T(X), then the test function 


1, #fTX>k 
P'(T(X)) =} va, TR) =ky (4.3.9) 


0, otherwise 
has the following properties. 


(i) Itis UMP of its size for testing Ho : 0 < 00 against H, : 0 > 0, where —c < 
80 < ©, provided the size of the test is not zero. 


(ii) For every a, 0 <a <1, there exist constants ky, Yu; —W <ky < ow, O< 
Ya < 1, for which the corresponding test function ¢°(T(X)) is UMP of size a. 


(iii) The power function of $°(T(X)) is nondecreasing in 0. 
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Proof. For simplicity of notation we let T(x) = x (real). 
(i) From the Neyman—Pearson Lemma, a most powerful test of Hj : 6 = 6 against 
Hy : 6 = 6, 6 > 6 is of the form 


if SF (X51) 
: f(X; 6) 
$°(X) = f FGA) (4.3.10) 
*  f (X60) 
0, otherwise 


provided 0 < k < ow. Hence, since F is an MLR w.rt. X, f(X301)/f(X3 09) > k 
implies that X > xg. xo is determined from the equation f (x9; 01)/f (x0; 4) = k. 
Thus, (4.3.9) is also most powerful for testing Hj against H;* at the same size as 
(4.3.10). The constants xp and y are determined so that (4.3.9) and (4.3.10) will have 
the same size. Thus, if @ is the size of (4.3.10) then x9 and y should satisfy the 
equation 


Po {X > Xo} + y Pa{X = xo} =a. (4.3.11) 


Hence, x9 and y may depend only on 6, but are independent of 6,. Therefore, 
the test function ¢°(X) given by (4.3.9) is uniformly most powerful for testing Ay 
against Hp. Moreover, since $°(X) isa nondecreasing function of X, the size of the 
test @° (for testing Hy against H,) is a. Indeed, from Karlin’s Lemma the power 
function W(0;¢°) = Eo{¢°(X)} isa nondecreasing function of 0 (which proves (iii)). 
Hence, sup Ey {b°(X )} = a. Thus, p(X ) is uniformly most powerful for testing Hp 
0<6 
against Hy. 
(ii) The proof of this part is simple. Given any a, 0 < a < 1, weset x? = F-'(1 — 
a; 69) where Foi, 0) denotes the y-quantile of F(x; 6). If F(x; 69) is continuous 
at x°, we set y = 0, otherwise 


pe F (xo; 60) — 1 — @) (4.3.12) 
F (x0; 60) — F(x — 0; 4) 
QED 


4.4 TESTING TWO-SIDED HYPOTHESES IN ONE-PARAMETER 
EXPONENTIAL FAMILIES 


Consider again the one-parameter exponential type family with p.d.f.s 
f (x30) = h(x) exp{OU(x) — K(0)},  —00 <0 < ov. 


A two-sided simple hypothesis is Hp : 8 = 6), —oo < 0) < oo. We consider Hp 
against a composite alternative H; : 0 ~ 6. 
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If X = (Xj,..., X,)/ is a vector of independent and identically distributed (i.i.d.) 
random variables, then the test is based on the minimal sufficient statistic (m.s.s.) 
n 
T(X) = yu (X;). The distribution of T(X), for any 0, is also a one-parameter 
i=l 
exponential type. Hence, without loss of generality, we present the theory of this 
section under the simplified notation T(X) = X. We are seeking a test function 
°(X) that will have a power function, which is attaining its minimum at 6 = 69 
and Eq,{b°(X)} = a, for some preassigned level of significance a, 0 < a < 1. We 
consider the class of two-sided test functions 


1, ifx> c2) 
”, ifx=c® 

Ox) = 40, if <x < (4.4.1) 
vy, ifx= co) 


1, ifx< cf), 


where c{!) < c®), Moreover, we determine the values of c\), y,, c’, yy by considering 
the requirement 


(i) Ea{¢°(X)} = a, 


9 ‘ (4.4.2) 
(i) 35 Fol PCOH,_4, = O- 
Assume that y;) = y2 = 0. Then 
) 
9g Bale?) = —K'(0)Eo{b°(X)} + Eg{Xb(X)}. (4.4.3) 
Moreover, 
K'(0) = Eo{X}. (4.4.4) 
Thus, 
0 
99 EAP og, = — YEW {X} + Ea {XG}. (4.4.5) 


It follows that condition (ii) of (4.4.2) is equivalent to 


Eq {X$°(X)} = a Eg, {X}. (4.4.6) 
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It is easy also to check that 


2 


0 
ag FolH(X))} loxa, = 0 Vin (X}. (4.4.7) 


Since this is a positive quantity, the power function assumes its minimum value at 
6 = Oo, provided °(X) is determined so that (4.4.2) (i) and (4.4.6) are satisfied. As 
will be discussed in the next section, the two-sided test functions developed in this 
section are called unbiased. 

When the family F is not of the one-parameter exponential type, UMP unbiased 
tests may not exist. For examples of such cases, see Jogdio and Bohrer (1973). 


4.5 TESTING COMPOSITE HYPOTHESES WITH NUISANCE 
PARAMETERS—UNBIASED TESTS 


In the previous section, we discussed the theory of testing composite hypotheses when 
the distributions in the family under consideration depend on one real parameter. In 
this section, we develop the theory of most powerful unbiased tests of composite 
hypotheses. The distributions under consideration depend on several real parameters 
and the hypotheses state certain conditions on some of the parameters. The theory 
that is developed in this section is applicable only if the families of distributions under 
consideration have certain structural properties that are connected with sufficiency. 
The multiparameter exponential type families possess this property and, therefore, 
the theory is quite useful. First development of the theory was attained by Neyman 
and Pearson (1933, 1936a, 1936b). See also Lehmann and Scheffé (1950, 1955) and 
Sverdrup (1953). 


Definition 4.5.1. Consider a family of distributions, F = {F(x;@);0 € ©}, where 
0 is either real or vector valued. Suppose that the null hypothesis is Ho : 6 € Qo and 
the alternative hypothesis is H, : 0 € ©. A test function @(X) is called unbiased of 
sizea if 


sup Ee{p(X)} =a 
Oo 


and 
Eg{o(X)} =a, forall@ € O,. (4.5.1) 


In other words, a test function of size w is unbiased if the power of the test is not 
smaller than a whenever the parent distribution belongs to the family corresponding 
to the alternative hypothesis. Obviously, the trivial test 6(X) = a with probability 
one is unbiased, since Eg{@(X)} =a for all 6 € ©,. Thus, unbiasedness in itself 
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is insufficient. However, under certain conditions we can determine uniformly most 
powerful tests among the unbiased ones. Let ©* be the common boundary of the 
parametric sets © and ©, corresponding to Hp and H; respectively. More formally, 
if Oo is the closure of @p (the union of the set with its limit points) and ©, is the 
closure of @;, then ©* = @g N ©. For example, if 6 = (6;, 62), Qo = {0;6, < 0} 
and ©; = {0;0, > O}, then ©* = {0; 6, = 0}. This is the 6)-axis. In testing two-sided 
hypotheses, Ho : gy <A, < a (62 arbitrary) against H : 0; < ai? or 6; > @? (02 
arbitrary), the boundary consists of the two parallel lines ©* = {0 : 6, = at? or 
0, = 0}. 


Definition 4.5.2. For testing Hy : 0 € Oo against H, : 8 € ©, a test d(x) is called 
a-similar if E,{0(X)} = a for all @ € @p. It is called a-similar on the boundary' if 
Eo{o(X)} =a for alld € ©*. 


Let F* denote the subfamily of F, which consists of all the distributions F(x; 6) 
where @ belongs to the boundary ©*, between ©p and ©). Suppose that F* is such 
that a nontrivial sufficient statistic T(X) with respect to F* exists. In this case, 
E{@(X | T(X)} is independent of those 6 that belong to the boundary ©*. That is, 
this conditional expectation may depend on the boundary, but does not change its 
value when @ changes over ©*. If a test #(X) has the property that 


E{#(X) | T(X)} =a with probability 1 all 6 € ©*, (4.5.2) 


then $(X) is a boundary e-similar test. If a test @(X) satisfies (4.5.2), we say that 
it has the Neyman structure. If the power function of an unbiased test function 
@(X) of size a is a continuous function of @ (@ may be vector valued), then @(X) 
is a boundary a@-similar test function. Furthermore, if the family of distribution of 
T(X) on the boundary is boundedly complete, then every boundary a-similar test 
function has the Neyman structure. Indeed, since #7 is boundedly complete and 
since every test function is bounded, E»{¢(X)} =a for all 6 € ©* implies that 
E{@(X) | T(X)} = @ with probability 1 for all 6 in ©*. It follows that if the power 
function of every unbiased test is continuous in 9, then the class of all test functions 
having the Neyman structure with some a, 0 < a < 1, contains all the unbiased tests 
of size a. Thus, if we can find a UMP test among those having the Neyman structure 
and if the test is unbiased, then it is UMP unbiased. This result can be applied 
immediately in cases of the k-parameter exponential type families. Express the joint 
p.d.f. of X in the form 


k 


f(x36, v) = h(x)exp 4 0U(x) + a viTj(x) — KO, »| 4 (4.5.3) 


i=1 


'We also call such a test a boundary a-similar test. 
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where v = (,..., 1%)’ is a vector of nuisance parameters and @ is real valued. We 
consider the following composite hypotheses. 


(i) One-sided hypotheses 
Hy :9 <6,  v arbitrary, 
against 
H,:0> 0,  v arbitrary. 
(ii) Two-sided hypotheses 
Hy :0, <9 <6, »v arbitrary 
against 
M,:0 <0, or 6 > 63, _ v arbitrary. 
For the one-sided hypotheses, the boundary is 
©* = {(0,v); O@=6,  v arbitrary}. 
For the two-sided hypotheses, the boundary is 
©* = {(,v); 6=6, or 0=63,_ v arbitrary}. 
In both cases, the sufficient statistic w.r.t. F* is 
T(X) = (T\(X),..., TCX)’. 


We can restrict attention to test functions #(U, T) since (U, T) is a sufficient statistic 
for ¥. The marginal p.d.f. of T is of the exponential type and is given by 


g(ts0,v) = lf k(u, t) exploulaac| : 


k 
exp ps vt; — KO, »| ; 
i=l 


(4.5.4) 
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where k(u, t) = f T{x : U(x) =u, I(x) = t}h(x)du(x). Hence, the conditional p.d-f. 
of U given T is a one-parameter exponential type of the form 


h(u | t, 0) = k(u, t)exp{Ou}/ / is k(u, t) exp{Ou}dd(u). (4.5.5) 


According to the results of the previous section, we construct uniformly most 
powerful test functions based on the family of conditional distributions, with p.d.f.s 
(4.5.5). Accordingly, if the hypotheses are one-sided, we construct the conditional 
test function 


1, ifu > &(t) 
o'(ul|t)= 4 yolt), ifu = E(t) (4.5.6) 
0, otherwise, 


where &,(t) and y,(t) are determined so that 
Ea loU || TA&)=th=a (4.5.7) 


for all t. We notice that since T(X) is sufficient for F*, ya(t) and &(t) can be 
determined independently of v. Thus, the test function ¢°(U | T) has the Neyman 
structure. It is a uniformly most powerful test among all tests having the Neyman 
structure. 

In the two-sided case, we construct the conditional test function 


1, ifU <&(T)orU > &(T) 
~(U [T)=4y(7), ifU =&(T),i=1,2 (4.5.8) 
0, otherwise 


where &,(T), &2(T), 7i(T), and y2(T) are determined so that 
Ego |T)| TR} =a 


with probability one. As shown in the previous section, if in the two-sided case 
0, = 02 = @, then we determine y;(T) and &;(T) (i = 1, 2) so that 


@) Ea{¢U|T)|T}=a wo, 
(4.5.9) 
di) E,{U¢°U | T)| T} =a Eq {U|T} wp, 


where w.p.1 means “with probability one.” The test functions ¢°(U | T) are uniformly 
most powerful unbiased ones. 

The theory of optimal unbiased test functions is strongly reinforced with the 
following results. Consider first the one-sided hypotheses Hp : 8 < 6, v arbitrary; 
against H) : 0 > 6, v arbitrary. We show that if there exists function W(U, T) that 
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is increasing in U for each T (U is real valued) and such that W(U, T) and T are 
independent under Hp, then the test function 


1 wfW>C, 
P'(W)= Fru, ifW=Cy (4.5.10) 
0, otherwise 


is uniformly most powerful unbiased, where C, and y, are determined so that the 
size of f°(W) is w. Indeed, the power of °(W) at (Qo, v) is a by construction. Thus, 


Po viWU, T) > Cy} + Yo Pay vf WU, T) = Cy} =a. (4.5.11) 


Since W(U, T) is independent of T at (@),v), Cy and y, are independent of T. 
Furthermore, since W(U, T) is an increasing function of U for each T, the test 
function ¢° is equivalent to the conditional test function (4.5.6). Similarly, for testing 
the two-sided hypotheses Ho : 6; < 0 < 62, v arbitrary, we can employ the equivalent 
test function 


1 wfWw<CjorW>C, 
eW)=)r, ifW=C,i=1,2 (4.5.12) 
0, otherwise. 


Here, we require that W(U, T) is independent of T at all the points (6;, v) and (62, v). 
When 6; = 62 = 4, we require that W(U, T) = a(T)U + D(T), where a(T) > 0 
with probability one. This linear function of U for each T implies that condition 
(4.5.9) and the condition 


Ea{~(W) | T} =a 
Ea{Wo(W) | T} =a Eq {W | T}, 


(4.5.13) 


are equivalent. 


4.6 LIKELIHOOD RATIO TESTS 


As defined in Section 3.3, the likelihood function L(@ | x) is a nonnegative function 
on the parameter space ©, proportional to the joint p.d.f. f(x;@). We discuss here 
tests of composite hypotheses analogous to the Neyman—Pearson likelihood ratio 
tests. If Ho is a specified null hypothesis, corresponding to the parametric set @g and 
if © is the whole sample space, we define the likelihood ratio statistic as 


eo | Xp) 
WOR Ot ee ee 4.6.1 
(Xn) supL(@ | X,) ( ) 


660 
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Obviously, 0 < A(x,) < 1. A likelihood ratio test is defined as 


_ ji, if ACK) < Cy 
(Xn) = to otherwise, (4.6.2) 
where Cy, is determined so that 
sup Pe{A(X,) < Ca} < a. (4.6.3) 


6€Oo 


Due to the nature of the statistic A(X,,), its distribution may be discontinuous at 
A = | even if the distribution of X,, is continuous. For this reason, the test may not 
exist for every a. 

Generally, even if a generalized likelihood ratio test of size a exists, it is difficult 
to determine the critical level C,. In Example 4.14 we demonstrate such a case. 
Generally, for parametric models, the sampling distribution of A(X), under Ho, can be 
approximated by simulation. In addition, under certain regularity conditions, if Ho is 
a simple hypotheses and @ is a k-dimensional vector, then the asymptotic distribution 
of —2log A(X,,) as n — oo is like that of x7[m], where m = dim(@) — dim(@p), 
(Wilks, 1962, Chapter 13, Section 13.4). Thus, if the sample is not too small, the 
(1 — w)-quantile of x*[m] can provide a good approximation to —2 log Cy. In cases 
of a composite null hypothesis we have a similar result. However, the asymptotic 
distribution may not be unique. 


4.6.1 Testing in Normal Regression Theory 


A normal regression model is one in which n random variables Y;,..., Y, are 

observed at n different experimental setups (treatment combinations). The vector 

Y, = (Y,..., Y,)' is assumed to have a multinormal distribution N(XB, o71), where 

X is ann x p matrix of constants with rank = p and B’ = (f),..., Bp) is a vector 

of unknown parameters, | < p <n. The parameter space is © = {(f),..., By, o); 

—oo < B; < ow foralli = 1,..., pand0 <o < oo}. Consider the null hypothesis 
Ao: Bri =-+> = Bp =9, fBi,.-.,B8-, © arbitrary, 


where | <r < p. Thus, Oo = {(61,..., 6, 0,...,0,0); —oo < B; < oo foralli = 
1,...,7r; 0 <o < ov}. This is the null hypothesis that tests the significance of the 
(p —r) B-values B; (j =r +1,..., p). The likelihood function is 


1 1 
L(p,o | Y,X)= a exp 7526 XBY(Y - xp)| 3 


We determine now the values of 6 and o for which the likelihood function is maxi- 
mized, for the given X and Y. Starting with 8, we see that the likelihood function is 
maximized when Q(8) = (Y — Xf)'(Y — X8) is minimized irrespective of 0. The 
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vector 6 that minimizes Q({) is called the least-squares estimator of 6. Differenti- 
ation of Q(8) with respect to the vector B yields 


V Q(B) = —2X'(¥ — XB). (4.6.4) 
Equating this gradient vector to zero yields the vector 
B= (X'X) XY. (4.6.5) 


We recall that X’X is nonsingular since X is assumed to be of full rank p. Substituting 
Q() in the likelihood function, we obtain 


x 1 1 iS 
L(p,o) = — exp {-33008)} , (4.6.6) 
oO 20 
where 
Q(B) = YU — X(X'X) XY, (4.6.7) 


and A = J — X(X’X)~!X’ is asymmetric idempotent matrix. Differentiating L(B, 0) 
with respect to o and equating to zero, we obtain that the value a” that maximizes 
the likelihood function is 


1 5 
67 = -Y'AY= Q(B)/n. (4.6.8) 
n 
Thus, the denominator of (4.6.1) is 


1 
sup sup L(f,o | Y, X) = — exp {-=}. (4.6.9) 
oO B an 2, 


We determine now the numerator of (4.6.1). Let K = (0: J,_,) bea (p—r) x p 
matrix, which is partitioned to a zero matrix of order (p — r) x r and the identity 
matrix of order (p —r) x (p —r). K is of full rank, and KK’ = I,_,. The null 
hypothesis Ho imposes on the linear model the constraint that KB = 0. Let 6* 
and 6? denote the values of B and o”, which maximize the likelihood function 
under the constraint KB = 0. To determine the value of 6*, we differentiate first the 
Lagrangian 


D(B, A) = (Y — XBY(Y — XB) + B'KA, (4.6.10) 
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where A is a (p —r) x | vector of constants. Differentiating with respect to B, we 
obtain the simultaneous equations 


(i) —2X'(¥—XB)+K'=0, 
(ii) KB =O. 


(4.6.11) 


From (i), we obtain that the constrained least-squares estimator 6* is given by 
* , —1 , 1 / a 1 , -lp 
BY = (X'X) ae =P BOX) KR. (4.6.12) 
Substituting 6* in (4.6.11) (ii), we obtain 
a 1 y -lp 
Kp= pace X) KD. (4.6.13) 


Since K is of full rank p — r, K(X’X)~'K’ is nonsingular. Hence, 
d= 2,K(X'X) 1K’) | KB 


and the constrained least-squares estimator is 


Bt =[1 — (XX) KR [K(X XY KT KYB. (4.6.14) 
To obtain «7, we employ the derivation presented before and find that 
5° = Q(B*)/n 
= “Y — XB + X(X'X) | K’B'K By (4.6.15) 


-[¥ = Xp + XX) 1 R'B' KB), 
where B = K(X’X)~!K’. Simple algebraic manipulations yield that 
~ 2 a2 1 aA’! p-l|reeA 
6° =6°+-—f'K'B KB. (4.6.16) 
n 


Hence, the numerator of (4.6.1) is 


(4.6.17) 


L(p*,6 | Y,X)= 
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The likelihood ratio is then 


he 2 7 —n/2 
=sA'K'B'KA) . (4.6.18) 


no 


A(Y,) = (: + 
This likelihood ratio is smaller than a constant C, if 
_"-Paypp-lysb A 
F= ee K'B KB/Q(6) (4.6.19) 


is greater than an appropriate constant ky, . In this case, we can easily find the exact 
distribution of the F-ratio (4.6.19). Indeed, according to the results of Section 2.10, 
Q(B) = Y'AY ~ 07 x?[n — p] since A = I — X(X’X)~!X’ is an idempotent matrix 
of rank n — p and since the parameter of noncentrality is 


1 
A= —p'X'(I — X(X'X) | X')XB =0. 
202 
Furthermore, 
BK'B"'KB = Y'X(X'X) | K'B UK (XX) XY. (4.6.20) 


Let C = X(X'X)"!K’B7!K(X'X)"!X’. It is easy to verify that C is an idempotent 
matrix of rank p — r. Hence, 


BK'B KB ~ 0° x7[p — 13 A"), (4.6.21) 
where 
* 1 lel p-l 
A* = —~f'K'B KB. (4.6.22) 
202 
We notice that KB = (6,41, ..., Bp)’, which is equal to zero if the null hypothesis is 


true. Thus, under Ho, 4* = 0 and otherwise, 4* > 0. Finally, 


CAS C= XOX RB ORONO RR 
= 0. 


(4.6.23) 


Hence, the two quadratic forms Y’ AY and Y’CY are independent. It follows that under 
Ho, the F ratio (4.6.19) is distributed like a central F[p — r,n — p] statistic, and the 
critical level ky is the (1 — w)-quantile F;_,[p — r,n — p]. The power function of 
the test is 


WA*) = P{F[p—r.n— p3"] > Fi-alp — 1, — p)}. (4.6.24) 
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A special case of testing in normal regression theory is the analysis of variance 
(ANOVA). We present this analysis in the following section. 


4.6.2 Comparison of Normal Means: The Analysis of Variance 


Consider an experiment in which r independent samples from normal distribu- 
tions are observed. The basic assumption is that all the r variances are equal, i.e., 


oP Se = os =o” (r > 2). We test the hypothesis Ho : “1 =--- = My, o” arbi- 
trary. The sample m.s.s. is (X1,..., X;, S?), where X; is the mean of the ith sample 


and Ss? is the pooled “within” variance defined in the following manner. Let n; be 
the size of the ith sample, v; = n; — 1; Ss? the variance of the ith sample; and let 


v= Soy Then 
i=] 


1 r 
S= 5 Y > v:8?. (4.6.25) 
i=l 


Since the sample means are independent of the sample variances in normal distribu- 
tions, S is independent of X,,..., X,. The variance “between” samples is 


Sp = Yo n(X; — XY, (4.6.26) 


r r 
where X = Soni Xi Doni. X is the grand mean. Obviously Ss? and S? are inde- 
i=l i=l ; ; 
OF a OS 8 2 g 
pendent. Moreover, under Ho, S Rereney | [v] and S; ~ 
v r- 


pe — 1]. Hence, the 


variance ratio 
F = S;/S; (4.6.27) 


is distributed, under Ho, like a central F[r — 1, v] statistic. The hypothesis Ho is 
rejected if F > F\_,[r — 1, v]. If the null hypothesis Hp is not true, the distribu- 
2 


o 
tion of S? is like that of a x°[r — 1,4], where the noncentrality parameter is 
pS 


given by 


1 r 
k= 55D milo — wy (4.6.28) 
i=l 
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r r 
and = Sonipi/ a is a weighted average of the true means. Accordingly, the 


i=1 
power of the test, as i function of A, is 


WA) = P{F[r —1,v;a] > Fi-a[r — 1, v]}. (4.6.29) 


This power function can be expressed according to (2.12.22) as 


-—1 
wa)=e"* 3 = h- R(E) (5.7 am + i), (4.6.30) 


r—-1l r—l 
where € = F\_,[r — 1, v] and R(€) = arrae- (: + 7 :). 


4.6.2.1 One-Way Layout Experiments 

The F-test given by (4.6.27) is a basic test statistic in the analysis of statistical 
experiments. The method of analysis is known as a one-way layout analysis of 
variance (ANOVA). Consider an experiment in which N = n - r experimental units 
are randomly assigned to r groups (blocks). Each group of n units is then subjected to 
a different treatment. More specifically, one constructs a statistical model assuming 
that the observed values in the various groups are samples of independent random 
variables having normal distributions. Furthermore, it is assumed that all the r normal 
distributions have the same variance o? (unknown). The r means are represented by 
the linear model 


M=wt+t, i=1,...,r (4.6.31) 


where ba = 0. The parameters T,,..., t represent the incremental effects of the 


i=l 

treatments. jz is the (grand) average yield associated with the experiment. Testing 
whether the population means are the same is equivalent to testing whether all t; = 0, 
i=1,...,r. Thus, the hypotheses are 


Ho: Ses =0 
i=l 


against 


r 
A: s e =O. 
i=l 
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We perform the F-test (4.6.27). The parameter of noncentrality (4.6.28) assumes the 
value 


n if 2 
re 5a » ce (4.6.32) 


4.6.2.2 Two-Way Layout Experiments 

If the experiment is designed to test the incremental effects of two factors (drug A and 
drug B) and their interaction, and if factor A is observed at r; levels and factor B at rz 
levels, there should be s = r x r2 groups (blocks) of size n. It is assumed that these s 
samples are mutually independent, and the observations within each sample represent 
i.i.d. random variables having N({u;;, o”) distributions, i = 1,...,71; jJ=1,...,1%. 
The variances are all the same. The linear model is expressed in the form 


A B AB F : 
Mj HSM +7; +4; , Sd, rye | fH 1c 123 


MY rr r r 
where ) eens ) a = 0, and ) ty? = Oforeachi = 1,..., 7; and ) «7 = 
i=l j=l j=l i=l 


0 for each j = 1,..., 72. The parameters 7/‘ are called the main effects of factor A; 


t? are called the main effects of factors B; and t;;” are the interaction parameters. 
The hypotheses that one may wish to test are whether the main effects are significant 


and whether the interaction is significant. Thus, we set up the null hypotheses: 
HD Tees =o, 
i J 
(2), Ay2 __ 
Hy’: DUGi'y =0, (4.6.33) 
HO: Sey 0 
J 
These hypotheses are tested by constructing F-tests in the following manner. Let 
Xi i=l,...,mi;j=l,...,r2;andk = 1,...,n designate the observed random 
variable (yield) of the kth unit at the (i, j)th group. Let X; ; denote the sample mean 
of the (i, j)th group; X;., the overall mean of the groups subject to level i of factor A; 


X.;, the overall mean of the groups subject to level j of factor B; and X, the grand 
mean; 1.e., 


oT, 


ial 
ll 
vie 
M 
ea 
il 


(4.6.34) 


wea PH leniasrs; 


pal 
ll 


268 TESTING STATISTICAL HYPOTHESES 


and 


2) _. “ih : 
ee Be 
j 


Pir. | 


The sum of squares of deviations around X is partitioned into four components in the 
following manner. 


a 
= 
0 
= 


» > Y (Xi ~xy= (Xijn — Xij¥° 


i=l] j=l k=l i=l j=l k=l 


+n>> \ (Xi — Xi. — Xj + XY (4.6.35) 


i=l j=l 


ry 2 
+ nro ees oo x + nr, Lee = xX). 
al 


j=l 


The four terms on the right-hand side of (4.6.35) are mutually independent quadratic 
forms having distributions proportional to those of central or noncentral chi-squared 
random variables. Let us denote by Q, the quadratic form on the left-hand side of 
(4.6.35) and the terms on the right-hand side (moving from left to right) by Ow, 
Qas, Qa, and Qz, respectively. Then we can show that 


Ow ~ o°x*[vwl], where vw = N —s. (4.6.36) 
Similarly, 
OQap~o°xX[vapsAagl, where vag =("%1-1)x( 2-1) (4.6.37) 


and the parameter of noncentrality is 
Nas == Gy (4.6.38) 


Let S| = Ow/vw and er = Qap/vap. These are the pooled sample variance 
within groups and the variance between groups due to interaction. If the null hypoth- 
esis HY of zero interaction is correct, then the F-ratio, 


F = Si, /Sy, (4.6.39) 
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is distributed like a central F[v,4g, vw]. Otherwise, it has a noncentral F-distribution 
as F[vap, vw;Aap]. Notice also that 


E{S2,}=07 (4.6.40) 
and 
E{S 3} = 0? +nor,, (4.6.41) 
where 
12 
OAR = o Cale (4.6.42) 


Formula (4.6.42) can be easily derived from (4.6.38) by employing the mixing rela- 
tionship (2.8.6), x7[vasiAasl ~ x7[vagp + 2/], where J is a Poisson random vari- 
able, P(A 48). To test the hypotheses HO and AS, concerning the main effects of A 
and B, we construct the F-statistics 


S2 
Fxa= rs ; 
ug (4.6.43) 
S2 
A 
Fz = 3 


where 53 = Qa/va4, Va = 71 — 1 and Sz, = QOp/vp, Vg = 12 — 1. Under the null 
hypotheses these statistics have central F[v4, vw] and F[vg, vw] distributions. 
Indeed, for eachi = 1,...,7), X}. ~ N(ut+ ts a7 /nr2). Hence, 


Oa = nr (Xi — XP ~ 07 X7[vas dal (4.6.44) 


i=l 
with 


nr Ay2 
i pin an 4.6.45 
A = » (T°) ( ) 


Similarly, 


Op ~ 07x? [vp; AB] (4.6.46) 
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Table 4.1 A Two-Way Scheme for Analysis of Variance 


Source v Sum of Squares MS PF E{MS} 
S2 
Factor A r—1 On S3 = o? +nryo% 
Ww 
S2 
Factor B rm —1 Op Si # ot + nro; 
Sy 
S2 
Interaction (r, — 1). — 1) Oar Sip o7 +noip 
Ww 
Between groups rir —1 QO, — Ow - - = 
Within groups N —1rro Ow Se, - o? 
Total N-1 Q, - - - 
with 
nr, — 
1 2 
pee 5a ye) (4.6.47) 
oO oar 


Under the null hypotheses A and HS both A and Az are zero. Thus, the (1 — a)- 
quantiles of the central F-distributions mentioned above provide critical values of 
the test statistics F', and Fg. We also remark that 


E{Si} =o7*+ nryo% 


(4.6.48) 
E{S3} =o7*+ nr\oz 
where 
Fe 
26a Ay2 
J ae Le ey 
(4.6.49) 


1Y 
J 2. By2 
on = — TY. 
4 met? 
j=! 


These results are customarily summarized in the following table of ANOVA. 
Finally, we would like to remark that the three tests of significance provided by 
Fag, Fa, and Fz are not independent, since the within variance estimator S2, is used 
by all the three test statistics. Moreover, if we wish that the level of significance of 
all the three tests simultaneously will not exceed a, we should reduce that of each 
test to a/3. In other words, suppose that H, He, and He are true and we wish 
not to reject either one of these. We accept simultaneously the three hypotheses in 
the event of {Fag < Fi-a/3lvap, vw], Fa < Fi-a3[va, vB), Fe < Fi-a/3lve, vw}. 
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According to the Bonferroni inequality, if £,, EZ, and E3 are any three 
events 


P{E, OE. E3} = 1— P{E, U Ey U E3} = 1 — P{E\} — P{E2} — P{E3}, 
(4.6.50) 


where E; (i = 1, 2, 3) designates the complement of E;. Thus, the probability that all 
the three hypotheses will be simultaneously accepted, given that they are all true, is at 
least 1 — a. Generally, a scientist will find the result of the analysis very frustrating if 
all the null hypotheses are accepted. However, by choosing the overall @ as sufficiently 
small, the rejection of any of these hypotheses becomes very meaningful. For further 
reading on testing in linear models, see Lehmann (1997, Chapter 7), Anderson (1958), 
Graybill (1961, 1976), Searle (1971) and others. 


4.7 THE ANALYSIS OF CONTINGENCY TABLES 


4.7.1 The Structure of Multi-Way Contingency Tables and 
the Statistical Model 


There are several qualitative variables A,,..., Ag. The ith variable assumes m; levels 
k 


(categories). A sample of N statistical units are classified according to the M = | [~ F 


combinations of the levels of the k variables. These level combinations will be allied 
cells. Let f(ii,...,%,%) denote the observed frequency in the (i;,..., i,) cell. We 
distinguish between contingency tables having fixed or random marginal frequencies. 
In this section we discuss only structures with random margins. The statistical 
model assumes that the vector of M frequencies has a multinomial distribution with 
parameters N and P, where P is the vector of cell probabilities P(i;,..., i,). We 
discuss here some methods of testing the significance of the association (dependence) 
among the categorical variables. 


4.7.2 Testing the Significance of Association 


We illustrate the test for association in a two-way table that is schematized below. 


Table 4.2 A Scheme of a Two-Way Contingency Table 


A Am x 
B, FO. 9 area achs fd, m1) fd, +) 
By FSD. -  Necikes fQ,m,) fQ,°) 
Buin S(m2,1) see Sf (m2, 1m) f(m2, +) 


272 TESTING STATISTICAL HYPOTHESES 


FG, j) is the observed frequency of the (i, j)th cell. We further denote the observed 
marginal frequencies by 


f= SIG De PSDs 


if (4.7.1) 
(C.D = SGD. J Seats 
i=1 
Let 
PU) = PEA), 
ie (4.7.2) 


m2 


POp= >" PG): 


i=1 


denote the marginal probabilities. 

The categorical variables A and B are independent if and only if P(i, j) = 
P(i, -)P(, j) for all (i, j). Thus, if A and B are independent, the expected frequency 
at (i, j) is 


E(i, j) = NPG, -)P(, j). (4.7.3) 
Since P(i, -) and P(., 7) are unknown, we estimate E(i, j) by 


fi.) fod 
N N (4.7.4) 
= fi, )fC, P/N. 


e(i, j) =N 


The deviations of the observed frequencies from the expected are tested for random- 
ness by 


5 5 FED = ett DY 
e(i, j) 


(4.7.5) 
i=l j=l 


Simple algebraic manipulations yield the statistic 


EG dD 
=e 2G): rene . (4.7.6) 


We test the hypothesis of no association by comparing X? to the (1 — a)th quantile of 
x7[v] with v = (m, — 1)(m2 — 1) degrees of freedom. We say that the association is 
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significant if X* > x?_,[v]. This is a large sample test. In small samples it may be 
invalid. There are appropriate test procedures for small samples, especially for 2 x 2 
tables. For further details, see Lancaster (1969, Chapters XI, XII). 


4.7.3 The Analysis of 2 x 2 Tables 


Consider the following 2 x 2 table of cell probabilities 


S PF P x 
R 


W P(,1) PU,2) PQ,-) 
NW P(2,1) P(2,2) P(2,°) 
x PEM)! : PRD) 1 


S and R are two variables (success in a course and race, for example). The odds 
ratio of F/P for W is defined as P(1, 1)/P(1, 2) and for NW itis P(2, 1)/P(2, 2). 

These odds ratios are also called the relative risks. We say that there is no inter- 
action between the two variables if the odds ratios are the same. Define the cross 
product ratio 


able PEW. PO bP G,2) 
~ P(,2)- P2,2) PA,2)P(,1) 


2 (4.7.7) 


If o = 1 there is no interaction; otherwise, the interaction is negative or positive 
according to whether p < | or p > 1, respectively. Alternatively, we can measure 
the interaction by 


w = log p = log P(1, 1) — log P(1, 2) — log P(2, 1) + log P(2, 2). (4.7.8) 


We develop now a test of the significance of the interaction, which is valid for any 
sample size and is a uniformly most powerful test among the unbiased tests. 

Consider first the conditional joint distribution of X = fC, 1) and Y = f(2, 1) 
given the marginal frequency T = f(1, 1) + fC, 2). It is easy to prove that condi- 
tional on T, X and Y are independent and have conditional binomial distributions 
B(T, PC, 1)/PC,-))and BN — T, P(2, 1)/P(2, -)), respectively. We consider now 
the conditional distribution of X given the marginal frequencies T = f(1,-) and 
S= fd,0)+ f(, 1) = f¢, 1). This conditional distribution has the p.d.f. 


P[X =x|T=t,S=s]= 


(4.7.9) 


Be 
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where ¢ A s = min(t,s) and p¢ is the interaction parameter given by (4.7.7). The 
hypothesis of no interaction is equivalent to Hp : p = 1. Notice that for o = | the 
p.d.f. (4.7.9) is reduced to that of the hypergeometric distribution H(N, T, S). We 
compare the observed value of X to the w/2- and (1 — w/2)-quantiles of the hyper- 
geometric distribution, as in the case of comparing two binomial experiments. For a 
generalization to 2” contingency tables, see Zelen (1972). 


4.7.4 Likelihood Ratio Tests for Categorical Data 


Consider a two-way layout contingency table with m, levels of factor A and m2 
levels of factor B. The sample is of size N. The likelihood function of the vector P 
of s = m, X mz cell probabilities, P(Z, 7), is 


mM, Mm 


Le Nf) =]] [[eaay”. (4.7.10) 


i=1 j=l 


where f(i, j) are the cell frequencies. The hypothesis of no association, Hp imposes 
the linear restrictions on the cell probabilities 


Pi, j)= PG@,-)P¢, J), forall @, J). (4.7.11) 


Thus, ©o is the parameter space restricted by (4.7.11), while © is the whole space of 
P. Thus, the likelihood ratio statistic is 


my, m2 


sup] | ] [[P@ PG AY? 


gt F=1 


Af, N) = a (4.7.12) 
sup] [ [ [PG ay"? 
© j=1 j=l 
By taking the logarithm of the numerator and imposing the constraint that 
my 
yo PG) =1, 
i=l 
m2 
So Gglaas 
j=l 
we obtain by the usual methods that the values that maximize it are 
PG, )=fG)/N, i=1,...,m 
(4.7.13) 


PU A=FOD/N, fHl,...,me. 
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Similarly, the denominator is maximized by substituting for P(i, 7) the sample esti- 
mate 


PG D=FfGP/N, i=l,...,m, jfHl,...,m. (4.7.14) 


We thus obtain the likelihood ratio statistic 


m, mo fii) 
AGN) =[] (42) (4.7.15) 


i=l-j=1 


Equivalently, we can consider the test statistic — log A(f; NV), which is 


: ve NPG 
ee , J log ———-—_.. 4.7.16 
De 2 IO DB TED ICD wee 


Notice that A* is the empirical Kullback—Leibler information number to discriminate 
between the actual frequency distribution f(i, j)/N and the one corresponding to the 
null hypothesis f (i, -) f(-, 7)/N*. This information discrimination statistic is different 
from the X? statistic given in (4.7.6). In large samples, 2A* has the same asymptotic 
x7[v] distribution with v = (m, — 1)(m2 — 1). Insmall samples, however, it performs 
differently. 

For further reading and extensive bibliography on the theory and methods of 
contingency tables analysis, see Haberman (1974), Bishop, Fienberg, and Holland 
(1975), Fienberg (1980), and Agresti (1990). For the analysis of contingency tables 
from the point of view of information theory, see Kullback (1959, Chapter 8) and 
Gokhale and Kullback (1978). 


4.8 SEQUENTIAL TESTING OF HYPOTHESES 


Testing of hypotheses may become more efficient if we can perform the sampling 
in a sequential manner. After each observation (group of observations) we evaluate 
the results obtained so far and decide whether to terminate sampling and accept (or 
reject) the hypothesis Hp, or whether to continue sampling and observe an addi- 
tional (group of) observation(s). The main problem of sequential analysis then is 
to determine the “best” stopping rule. After sampling terminates, the test function 
applied is generally of the generalized likelihood ratio type, with critical levels asso- 
ciated with the stopping rule, as will be described in the sequel. Early attempts to 
derive sequential testing procedures can be found in the literature on statistical qual- 
ity control (sampling inspection schemes) of the early 1930s. The formulation of 
the general theory was given by Wald (1945). Wald’s book on sequential analysis 
(1947) is the first important monograph on the subject. The method developed by 
Wald is called the Wald Sequential Probability Ratio Test (SPRT). Many papers 
have been written on the subject since Wald’s original work. The reader is referred to 
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the book of Ghosh (1970) for discussion of the important issues and the significant 
results, as well as notes on the historical development and important references. See 
also Siegmund (1985). We provide in Section 4.8.1 a brief exposition of the basic 
theory of the Wald SPRT for testing two simple hypotheses. Some remarks are given 
about extension for testing composite hypotheses and about more recent develop- 
ment in the literature. In Section 4.8.2, we discuss sequential tests that can achieve 
power one. 


4.8.1 The Wald Sequential Probability Ratio Test 


Let X,, X2,... be a sequence of i.i.d. random variables. Consider two simple 
hypotheses Hp and Hj, according to which the p.d.f.s of these random variables 
are fo(x) or fi(x), respectively. Let R(X;) = fi(X;)/fo(X;) i = 1,2,... be the 
likelihood ratio statistics. The SPRT is specified by two boundary points A, B, 
—oo < A <0 < B < ow and the stopping rule, according to which sampling contin- 


ues as long as the partial sums S, = oa log R(X;),n = 1, 2,..., lie between A and 
B.Assoonas S, < Aor S, > B, saan terminates. In the first case, Hp is accepted 
and in the second case, H; is accepted. The sample size N is a random variable that 
depends on the past observations. More precisely, the event {N <n} depends on 
{X1,..., X,} but is independent of {Xy41, Xn42,...} for alln =1,2,.... Such a 
nonnegative integer random variable is called a stopping variable. Let B,, denote the 
o-field generated by the random variables Z; = log R(X;),i = 1,...,n. A stopping 
variable N defined with respect to Z;, Z2, ... is an integer valued random variable NV, 
N > 1, such that the event {N > n} is determined by Z1,..., Z,-1 (n = 2). In this 
case, we say that {N > n} € 6,_; and J{N > n} is B,_, measurable. We will show 
that for any pair (A, B), the stopping variable N is finite with probability one. Such 
a stopping variable is called regular. We will see then how to choose the boundaries 
(A, B) so that the error probability a and 6 will be under control. Finally, formulae 
for the expected sample size will be derived and some optimal properties will be 
discussed. 

In order to prove that the stopping variable N is finite with probability one, we 
have to prove that 


lim Pea{N >n}=0, for 6=0,1. (4.8.1) 
n->oo 
Equivalently, for a fixed integer r (as large as we wish) 


lim Po{N >mr}=0, for d=0,1. (4.8.2) 


m—> Co 


For 6 = O or 1, let 


HO) = Ea flog R(X)}, (4.8.3) 
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and 
D?(0) = Varg {log R(X)}. (4.8.4) 
Assume that 
0<D*(0)<o, for @=0,1. (4.8.5) 


If D?(@) for some @, then (4.8.1) holds trivially at that 0. 
Thus, for any value of 6, the distribution of S,, = a log R(X;) is asymptotically 


i=l 
normal. Moreover, for each m = 1, 2,... and a fixed integer r, 


Po[N > mr] < Po[A < S, < B, |So, — S,| < C,..., [Sm r(m—1)| < C], 
(4.8.6) 


where C = |B — A|. 
The variables S;., So, — S;,..., Smr — San—1)r are independent and identically dis- 
tributed. Moreover, by the Central Limit Theorem, if r is sufficiently large, 


oe — H(8) c 7 H@) 


(4.8.7) 


The RHS of (4.8.7) approaches | as r > oo. Accordingly for any p,0 < p < l,ifr 
is sufficiently large, then Pg[|S,| < c] < p. Finally, since Sj, — S(j—1), is distributed 
like S, for all 7 = 1,2,...,7,if7 is sufficiently large, then 


Po[N > mr] < Po[A < S, < Blp™ |. (4.8.8) 


This shows that Py[N > n] converges to zero at an exponential rate. This property is 
called the exponential boundedness of the stopping variables (Wijsman, 1971). We 
prove now a very important result in sequential analysis, which is not restricted only 
to SPRTs. 


Theorem 4.8.1 (Wald Theorem). /f N is a regular stopping variable with finite 
expectation Eg{N}, and if X,, X2,... is a sequence of i.i.d. random variables such 
that Eg{|X\| < 0, then 


N 
Eo {> x| = E(O)Eo{N}, (4.8.9) 
i=1 
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where 


§(0) = Eo{Xi}. 


Proof. Without loss of generality, assume that X,, X2,... is a sequence of i.i.d. 
absolutely continuous random variables. Then, 


N lee) n 
E; ps x| = et I{N =n} Sx; f (Kn 9)dxn, (4.8.10) 
i=l n=l j=l 


where f(x,;0) is the joint p.d.f. of X, = (X1,..., X,). The integral in (4.8.10) is 
actually an n-tuple integral. Since Eg {|X |} < oo, we can interchange the order of 
summation and integration and obtain 


N oe) oo 
Eo [>>| = EE STN} f ns O)dXn 
i=] 


j=l n=j 


(4.8.11) 
CO 
= )> Eg{XjI{N > j}}. 
j=l 
However, the event {N > j} is determined by (X,, ..., X ;—1) and is therefore inde- 
pendent of X;, Xj+1,.... Therefore, due to the independence of the Xs, 
N oo 
Eo p x| = )_ Ee{Xi}Po{N > J) 
i=] j=1 
: (4.8.12) 
CO 
= &(0) > Pa{N > j}. 
j=l 


Finally, since N is a positive integer random variable with finite expectation, 


Eo{N} = )) Pa{N > j}. (4.8.13) 
j=l 


QED 


From assumption (4.8.5) and the result (4.8.8), both j(@) and Eg {N} exist (finite). 
Hence, for any SPRT, Eo{Sv} = uw(O)Eg{N}. Let 2(@) denote the probability of 
accepting Ho. Thus, if (6) 4 0, 


1 
Eg{N} = (Oye Eats | Sv < A} +1. — 1(0)) Eo {Sy | Sy 2 BH]. (4.8.14) 
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An approximation to E,g{N} can then be obtained by substituting A for Eg{Sy | 
Sy < A} and B for Eg{Sy | Sy => B}. This approximation neglects the excess over 
the boundaries by Sj. One obtains 


Eg{N} * ney + (1 — 1(@))B}. (4.8.15) 
A) 


Error formulae for (4.8.15) can be found in the literature (Ghosh, 1970). 
Let a and £ be the error probabilities associated with the boundaries A, B and let 


A’ =lo iw , B’=lo — Let a’ and p’ be the error probabilities associated 
g g Pp 
—a a 


with the boundaries A’, B’. 

Theorem 4.8.2. [f0 <a+ 6 < 1 then 
(i) a'+ pi <a+f8 
and 


(i) A’ <A, B'> B. 


Proof. Foreachn = 1, 2,... define the sets 


An = {Xn3 A’ < Sy < B’,..., A’ < Sy_1 < BY, S, < A}, 
R, = {Xn; A’ < S) < B’,..., A’ < S,_1 < B’,S, > B’}, 
Crh = {Xn, A’ < S} < B’,..., A! < Sy_1 < B', A’ < S, < B’}. 


The error probability a’ satisfies the inequality 


a => fT] fatxjrducss) 
a 


n=1 


iow soc (4.8.16) 
a 0 ; 
* 1-3 yy, [] Aeepaucey = TGA 8. 
Similarly, 
p= Si ] | A@paues) 
oe (4.8.17) 


B ” 7 B ; 
eee ee yee 25 =e): 
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Thus, 


Or «gt 2 B Z B 
1-p’” 1-86 l-a’~ l-a 


(4.8.18) 


From these inequalities we obtain the first statement of the theorem. To establish (11) 
notice that if R, = {x: A < S; < B,i=1,...,n—1,5S, > B}, then 


1—p= Do fT] Atspancs > e® D> fT] siteauets) 
n=1 n=l 


see Bese (4.8.19) 


= ae®, 


Hence, B’ = log 14 > B. The other inequality is proven similarly. QED 


It is generally difficult to determine the values of A and B to obtain the specified 
error probabilities a and 6. However, according to the theorem, if a and 6 are small 
then, by considering the boundaries A’ and B’, we obtain a procedure with error 
probabilities a’ and 6’ close to the specified ones and total test size w’ + B’ smaller 
than a + £. For this reason A’ and B’ are generally used in applications. We derive 
now an approximation to the acceptance probability 7(@). This approximation is 
based on the following important identity. 


Theorem 4.8.3 (Wald Fundamental Identity). Let N be a stopping variable asso- 
ciated with the Wald SPRT and Mg(t) be the moment generating function (m.g.f.) of 
Z = log R(X). Then 


Eo{e®* (Me(t)y %} = 1 (4.8.20) 


for all t for which Mo(t) exists. 


Proof. 


Ep{e'S*(Mo(t))%} = D0 Esl I{N = nje'S"(Mg(t))-"} 


aS 7 (4.8.21) 
= lim >> E{C{N > n} — I{N > n+ Ve(Mo(t))“"}. 
n=1 


Notice that J{N > n} is 6,_; measurable and therefore, for n > 2, 


Eg{I{N > nje'S"(Mo(t))-"} = Eo{I{N > njel'(Mg(t))"")}—— (4.8.22) 
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and Eg{I{N > 1}e'*'(M,(t))~'} = 1. Substituting these in (4.8.21), we obtain 


Eg{e*™(M@) *} =1- lim EUI{N 2 m+ Le""(Ma))-"}. 


Notice that E{e’S(M(t))-”} = 1 for all m = 1,2,... and all ¢ in the domain of 
convergence of M,(t). Thus, {e’S"(Mg(t))~”, m > 1} is uniformly integrable. Finally, 
since lim P{N > m}=0, 
m—>Co 
lim E{I{N > m+ le (M(t))-”} = 0. 
m—> Ooo 


QED 
Choose € > 0 so that, 
Pi = Po{Z >e} >0 (4.8.23) 
and 
Py, = Po{Z < —e} > 0. 


Then fort > 0, Mg(t) = Eo{e'7} > Pye". Similarly, fort < 0, Mg(t) > Poe. This 
proves that | i Mo(t) = oo. Moreover, for all t for which M(t) exists, 
t|> co 


d 
— Mo(t) = Eo{Ze'“}, 
dt 
2 


d 2 tZ 


(4.8.24) 


Thus, we deduce that the m.g.f. Mg(t) is a strictly convex function of t. The expectation 
(0) is M, (0). Hence, if (0) > 0 then Mp(f) attains its unique minimum at a negative 
value t* and Mo(t*) < 1. Furthermore, there exists a value tp, —oo < fo < t* < 0, 
at which Mg(to) = 1. Similarly, if (0) < 0, there exist positive values t* and 1°, 
0 < t* < 1° < 00, such that Mo(t*) < 1 and M,(t°) = 1. In both cases ¢* and fg are 
unique. 

The fundamental identity can be applied to obtain an approximation for the accep- 
tance probability 7(@) of the SPRT with boundaries A’ and B’. According to the 
fundamental identity 


T(O) Ege?" | Sy < AV + (1 — 1) Ele?" |S, > BY =1, (4.8.25) 
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where f9(0) ~ 0 is the point at which M,(t) = 1. The approximation for 2(@) is 
obtained by substituting in (4.8.25) 


l-a 


19() 
Bp eoON | Sy eA O* = ( : 


and 
bo to() 
Ep {eon | Sy > BJS ( *) . 
a 


This approximation yields the formula 


(0) 
fo ee 
Qa 


m0) = 1-8 AO) 3B Oe (4.8.26) 
a) (Ges) 
for all 6 such that (0) ¥ 0. If 4 is such that (69) = 0, then 
1-6 
log 
a a 
1 (09) = =p B (4.8.27) 
log log 
a l-a 


The approximation for Eg{N} given by (4.8.15) is inapplicable at 6). However, at 6, 
Wald’s Theorem yields the result 


Ea {Sy} = Ea{N}Ea{Z’}. (4.8.28) 
From this, we obtain for 69 


B 


Lap 


2 
) + (1 — 2(@)) (108 
Eg {Z7} 


(Oo) (108 i 


) 
Ea{N} & . (4.8.29) 


In Example 4.17, we have illustrated the use of the Wald SPRT for testing two 
composite hypotheses when the interval ©9 corresponding to Ho is separated from 
the interval ©, of H;. We obtained a test procedure with very desirable properties by 
constructing the SPRT for two simple hypotheses, since the family ¥ of distribution 
functions under consideration is MLR. For such families we obtain a monotone z (0) 
function, with acceptance probability greater than 1 — a for all 0 < 6) andz(@) < B 
for all 9 > 6, (Ghosh, 1970, pp. 100-103). The function z (8) is called the operating 
characteristic function O.C. of the SPRT. The expected sample size function E,{N} 
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increases to a maximum between 6p and 6; and then decreases to zero again. At 
@ = 6 and at 0 = 0, the function E,{N} assumes the smallest values corresponding 
to all possible test procedures with error probabilities not exceeding a and B. This is 
the optimality property of the Wald SPRT. We state this property more precisely in 
the following theorem. 


Theorem 4.8.4 (Wald and Wolfowitz). Consider any SPRT for testing the two 
simple hypotheses Ho : 0 = 0 against H, : 8 = 0, with boundary points (A, B) and 
error probabilities a and B. Let Eo,{N}, i = 0,1 be the expected sample size. If s 
is any sampling procedure for testing Hj against H, with error probabilities a(s) 
and B(s) and finite expected sample size Eo,{N(s)} (i = 0, 1), then a(s) < a and 
B(s) < B imply that Eo,{N} < Eo{N(s)}, fori = 0, 1. 


For the proof of this important theorem, see Ghosh (1970, pp. 93-98), Siegmund 
(1985, p. 19). See also Section 8.2.3. 

Although the Wald SPRT is optimal at 4) and at 6; in the above sense, if the 
actual 6 is between 6 and 6,, even in the MLR case, the expected sample size 
may be quite large. Several papers were written on this subject and more general 
sequential procedures were investigated, in order to obtain procedures with error 
probabilities not exceeding a and £ at 0p and 6; and expected sample size at 6) < 
@ < @, smaller than that of the SPRT. Kiefer and Weiss (1957) studied the problem 
of determining a sequential test that, subject to the above constraint on the error 
probabilities, minimizes the maximal expected sample size. They have shown that 
such a test is a generalized version of an SPRT. The same problem was studied 
recently by Lai (1973) for normally distributed random variables using the theory 
of optimal stopping rules. Lai developed a method of determining the boundaries 
{(A,, Bn), n = 1} of the sequential test that minimizes the maximal expected sample 
size. The theory required for discussing this method is beyond the scope of this 
chapter. We remark in conclusion that many of the results of this section can be 
obtained in a more elegant fashion by using the general theory of optimal stopping 
rules. The reader is referred in particular to the book of Chow, Robbins, and Siegmund 
(1971). For a comparison of the asymptotic relative efficiency of sequential and 
nonsequential tests of composite hypotheses, see Berk (1973, 1975). A comparison 
of the asymptotic properties of various sequential tests (on the means of normal 
distributions), which combines both the type I error probability and the expected 
sample size, has been provided by Berk (1976). 
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Example 4.1. A new drug is being considered for adoption at a medical center. It is 
desirable that the probability of success in curing the disease under consideration will 
be at least 6) = .75. A random sample of n = 30 patients is subjected to a treatment 
with the new drug. We assume that all the patients in the sample respond to the 
treatment independently of each other and have the same probability to be cured, 


284 TESTING STATISTICAL HYPOTHESES 


0. That is, we adopt a Binomial model B(30, @) for the number of successes in the 
sample. The value 6) = .75 is the boundary between undesirable and desirable cure 
probabilities. We wish to test the hypothesis that 9 > .75. 

If the number of successes is large the data support the hypothesis of large 0 
value. The question is, how small could be the observed value of X, before we should 
reject the hypothesis that 6 > .75. If X = 18 and we reject the hypothesis then 
a(18) = B(18; 30, .75) = .05066. This level of significance is generally considered 
sufficiently small and we reject the hypothesis if X < 18. a 


Example 4.2. Let X), X2,..., X, be iid. random variables having a common 
rectangular distribution R(0, 6), 0 < 6 < oo. We wish to test the hypothesis Hp : 
6 < 9 against the alternative H; : 6 > 6. An m.s.s. is the sample maximum X,(q). 
Hence, we construct a test function of size a, for some given a@ in (0, 1), which 
depends on X(,). Obviously, if Xin) > 0 we should reject the null hypothesis. Thus, 
it is reasonable to construct a test function @(X )) that rejects Hp whenever X(n) > Cw. 
Cy depends on @ and 6), i.e., 


if Xin) = Cy 
otherwise. 


(Xi) = tf 


Cy, is determined so that the size of the test will be a. At 0 = 4, 


n is n—-l Cy " 
Pa {X(n) = Ca} = an t" “dt =1— oe 
0 %Ca 


Hence, we set Cy = (1 — a)!/". The power function, for all 6 > 4p, is 


(a) n 
Wr() = Po{ Xin) = O0(1 — a)"/"} = 1-—a) (2) . 


We see that w(@) is greater than a for all 6 > 6. On the other hand, for @ < 4, the 
probability of rejection is 


0 


6 n 
Eo{o(X))} = 1 — min {I (3) d- «| . 


Accordingly, the maximal probability of rejection, when Hp is true, is ~w andif@ < 6, 
the probability of rejection is smaller than w. Obviously, if 9 < 6)(1 — a)!/”, then the 
probability of rejection is zero. a 


Example 4.3. Let X;,..., X,, bei.i.d. random variables having a normal distribution 
N(, 07). According to the null hypothesis Ho : = 41, 0 = 01. According to the 
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alternative hypothesis Hy, : & = 2, 0 = 02; 02 > 01. The likelihood ratio is 
fix) 0 \" _ x p\? (x —p\? 

= (21) ex} > 
fo(x) 2 2-5 2 al 

n 2 Qi tt 
(2) 1 0; — 05 ( Menai) 
=\ 5] APY 3 a i a era 
(oy) 2 ajo, “— 0, +02 


oO —0¢ 
(«+ 1/2 alt 
02 — 0) 


We notice that the distribution function of f|(X)/fo(X) is continuous and therefore 
Yq = 0. According to the Neyman—Pearson Lemma, a most powerful test of size 
a is obtained by rejecting Hp whenever /;(X)/fo(X) is greater than some positive 
constant ky. But, since 02 > oj, this is equivalent to the test function that rejects Ho 
whenever 


n 
(oy + 0 oO — C0 
Yo (x- 1M2 2H) (xi + 12 att) & Ca, 
0, + 02 02 — O01 


i=l 


where C, is an appropriate constant. Simple algebraic manipulations yield that Ho 
should be rejected whenever 


> (Xi - o) = Ch, 
i=1 


where 
@ = (03 W1 — 9; l2)/(0z — 97). 
We find C% in the following manner. According to Ho, 


X; —w ~ N65, 07) 


n 
with 5 = 02(u2 — w1)/(o? — 07). It follows that SX — wy ~ 07x? [n;nd?/207] 

i=l 
and thus, 

C= 07 Xj_q ln; nd°/207], 

where bee ie A] is the (1 — a)th quantile of the noncentral ae We notice that if 
[41 = [Lz but 0; ¥ oO», the two hypotheses reduce to the hypotheses Hj : “1 = yu, 
= G. versus Hy‘ : 42 = |, 02 #0}. In this case, 5 = 0 and CZ = o7x7_,[n].- If 
O01 = 02 but M2 > 1 (OF 2 < [1), the test reduces to the t-test of Example 4.9. 
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Example 4.4. In this example, we present a case of testing the shape parameter 
of a Weibull distribution. This case is important in reliability life testing. We show 
that even if the problem is phrased in terms of two simple hypotheses, it is not a 
simple matter to determine the most powerful test function. This difficulty is due 
to the fact that if the shape parameter is unknown, the minimal sufficient statistic 
is the order statistic. Let X,,..., X, be ii.d. random variables having a common 
Weibull distribution G!/"(1, 1). We wish to test the null hypothesis Hp : v = 1 against 
the simple alternative H; : v = 1+ 46; 6 is a specified positive number. Notice that 
under Ho, X; ~ E(1), i.e., exponentially distributed with mean 1. According to the 
Neyman—Pearson Lemma, the most powerful test of size a rejects Hy whenever 


n 6 n 
(1 +)" (1 x) exp {- Soap? — x0] ae oe 
i=1 i=1 


where k, is determined so that if Hp is correct, then the probability is exactly 
a. Equivalently, we have to determine a constant cy so that, under Hp, the 
probability of 


n 


) oe X;- ; (x}t? — x)| Cy 


i=] 


is exactly a. Let W;(5) = log X; — }(X;*° — X;) and S,(6) = Y Wi(). The prob- 
i=l 

lem is to determine the distribution of S,,(5) under Hp. Ifn is large, we can approximate 

the distribution of S,,(6) by a normal distribution. The expected value of W(6) is 


1 1 
M() = A cmt 0 ae 51 @ +6), 
where I’’(1) is the derivative of (x) at x = 1. The second moment of W(6) is 
2 ” 2 / / 
[2(6) = 5 +P") + ray (2) —-I"Q+5)) 
1 
+ pie + 26) — 27(34+ 4)). 
Thus, according to the Central Limit Theorem, if n is sufficiently large, then 
. 1 
lim P { atu) — nyey(5)) S x[Mo(5) — pe? | = Ox). 
noo Jn 


Accordingly, for large values of n, the critical level C, is approximately 


Cy & ni(8) + Z1-aVn {2(8) — 12(5)}'. 
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For small values of n, we can determine C, approximately by simulating many 
replicas of S,(5) values when X,,..., X, are E(1) and determining the (1 — a)th 
quantile point of the empirical distribution of S,,(6). a 


Example 4.5. Consider an experiment in which n Bernoulli trials are performed. Let 
K denote the number of successes among these trials and let 6 denote the probability 
of success. Suppose that we wish to test the hypothesis 


Hy: 0 < @ against H, : 0 > 6, 


at level of significance a. @) and a are specified numbers. The UMP (randomized) 
test function is 


1, if K > & (6) 
Q(K)= 4%, if K = &(%) 
0, otherwise 


where &,,(00) is the (1 — w)-quantile of the binomial distribution B(n, 9p), i.e., 


&y(@0) = least nonnegative integer, k, 


k 
such that Se b(jn, %) => 1l—a. 
i=0 


Furthermore, 


= B(Equ(Oo); 1, A) — (1 — a) 
: b(Ea(60); 2, 80) 


Accordingly, if the number of successes K is larger than the (1 — w)-quantile of 
B(n, 09), we reject Ho. If K equals &,(0), the null hypothesis Hp is rejected only 
with probability y,. That is, a random number R having a R(O, 1) distribution is 
picked from a table of random numbers. If K = & (09) and R < yy, Hp is rejected; 
if K = &,(@9) and R > yy, then Ho is accepted. If K < & (00), Ho is accepted. It is 
easy to verify that if 9 = 0p then the probability of rejecting Ho is exactly a. If 0 < 
this probability is smaller than @ and, on the other hand, if 6 > 6 the probability 
of rejection is greater than a. The test of this one-sided hypothesis Ho can be easily 
performed with the aid of tables of the cumulative binomial distributions. The exact 
power of the test can be determined according to the formula 


WO) = 1— BEa(9)3n, 8) + Ya + b(Ea(9)3 1, 4), 


where 6 > 6. If the hypotheses are one-sided but to the other direction, i.e., Ho : 
@ > O against H, : @ < 0, the UMP test is similar. | 
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Example 4.6. Consider the family F of Poisson distributions P(@),0 < @ < oo. The 
p.d.f.s are 


1 
f (x30) =e °O* /x! = — exp{x log é —6}, x=0,1,.... 
x! 
Thus, if we make the reparametrization w = log 6, then 


1 
f@aso)= — exp{xo — e*}, x=0,1,...3 -w<w<o@. 
x! 


This is a one-parameter exponential type family. The hypotheses Ho : 6 = 6p against 
H, : 0 40 (0 < @ < &) are equivalent to the hypotheses Hp : w = wo against 
A, : ®  w where wo = log &. The two-sided test p(X) of size a is obtained by 
(4.4.1), where the constants are determined according to the conditions (4.4.2) and 
(4.4.6). Since F is Poisson, E,,{X} = 4. Moreover, the p.d.f. of P(@) satisfies the 
relation 


JpG39) = 9p¥ -—1;6) forall j =1,2,.... 
We thus obtain the equations, for x, = c\? and x2 = c®, y; and y»: 
(i) P(x — 1360) + 11 p11; 80) + ¥2 (X23 A) + 1 — P(x2; 00) = a, 
(ii) P(x) — 2560) + i p(x — 1500) + y2p(X2 — 156) + 1 — Po — 1,0) = a. 
Here P(j;@) is the Poisson c.d.f. The function is zero whenever the argument j is 
negative. The determination of x1, y1, x2, 72 can be done numerically. We can start 


with the initial solution x,, y; and x2, y2 corresponding to the “equal-tail’” test. These 
initial values are determined from the equations 


P(x — 1560) + Vip@1; 60) = @/2, 
¥2P(X2; 0) + 1 — P(x2; 0) = a@/2. 


This initial solution can then be modified so that both equations (i) and (ii) will be 
satisfied simultaneously. a 


Example 4.7. Suppose that X ~ N(@, 1). The null hypothesis is Hp : 6 = 0. The 


alternative is H; :@ #0. Thus, x; and x2 should satisfy simultaneously the two 
equations 


MD &@))+1—-0@)=a 


689) [ xo(x)dx + [PO xeoax = 0. 


x2 
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Notice that x@(x) = —’(x). Accordingly, equation (II) can be written as 
UD) — $(%1) + $2) = 0. 


If we set x} = Zj~o/2 and x2 = —x, where z, = ®~'(7) then, due to the symmetry 
of the N(O, 1) distribution around 6 = 0, we obtain that these x; and x2 satisfy 
simultaneously the two equations. The “equal-tail” solution is the desired solution in 
this case. | 


Example 4.8. A. Testing the Significance of the Mean in Normal Samples 

The problem studied is that of testing hypotheses about the mean of a normal dis- 
tribution. More specifically, we have a sample X),..., X, of i.i.d. random variables 
from a normal distribution N(w, 07). We test the hypothesis 


Ao: Lh = Lo, o arbitrary 
against 


Mh: us bo, o arbitrary. 


= 7 i = . 
The m.s.s. is (X;,, Qn), where X, = ~S 0X; and QO, = wxe4 _ Ray. Consider the 
ea i=l 
t-statistic f = MX y — L)/Sn, where S? = Q, /(n — 1). The t-test of Hp against Hy 
is given by 


5 _ fl, if Mal Xn — Mol/Sn = theta -— 
P(Xns Sn) = fi otherwise. 


ti_o/2[n — 1] is the (1 — a/2)-quantile of the t-distribution with n — 1 degrees of 
freedom. It is easy to verify that this t-test has the size a. Its power function can be 
determined in the following manner. If uw 4 jo then 


xX, — 

a = t—a2[n =F ui 
= P{t(n —1;8/n] < —t_eln — 1} 
+ P{t(n — 1;,5./n] > tain — 1)}, 


where 6 = (uu — flo)/o. According to (2.12.22), this power function can be computed 
according to the formula 


lee) n 92\j 
; (56°) 1 v 
X)=1-e PY Io (= +5,5), 
w(o") e ji RO\ 5 tI 5 


j=0 
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where v =n —1, Cc=t-g/2[n — 1] and R(c) = 7 /(v +c’). We notice that the 
power function depends on 6? and is therefore symmetric around 5) = 0. We prove 


nd* 
now that the t-test is unbiased. Rewrite the power function as a function of A = — 


1 
and a mixture of Poisson P(A) with J (5 + J, >). where J ~ P(A) and R(c) = 


1 
c2/(v +c’). The family P(A) is MLR in J. Moreover, Ipc) G +i, 3) is a decreas- 
1 
ing function of j. Hence, by Karlin’s Lemma, w(A) = 1 — Ey {cs (5 + J, is 


an increasing function of A. Moreover, w(0) = a. This proves that the test is unbiased. 


B. Testing the Significance of the Sample Correlation 

(X1, Y1),..-, (Xn, Yn) are iid. random vectors having a bivariate normal dis- 
tribution. Let r be the sample coefficient of correlation (formula 2.13.1). Consider 
the problem of testing the hypothesis Hp : p < 0, (441, 2, 01, 2) arbitrary; against 
A, : p > O,7 (441, 2, 01, 02) arbitrary. Here we have four nuisance parameters. As 
shown in Section 2.15, the distribution of r is independent of the nuisance param- 
eters ({11, [42, 01, 02) and when p = 0 (on the boundary between ©po and ©);), it is 
independent of all the parameters. Moreover, according to (2.13.11), the following 
test is boundary a@-similar. 


: 
a a ee es 
V1—r? ; 


0, otherwise. 


or) = | 


The power function depends only on the parameter p and is given by 


> teciey OV? 
W(e) = P, r= ina 
n—-2+0t?_,[n—-2] 


According to (2.13.12), this is equal to 


n—4 


_ 2 ae 

VO) ao p°) 
ey ge reall j+l n (2)/ j+lon 
ef a )r( 2 \rG !) i! tao ( 2 5-1), 


j=0 


where R(t) = (n — 2)/(n —24+ hess [n — 2]). To show that this power function is a 
monotone nondecreasing function of o, one can prove that the family of densities of 
r under p (2.13.12) is an MLR with respect to r. Therefore, according to Karlin’s 
Lemma, E,,{¢(r)} is a nondecreasing function of p. Thus, the test function @(r) is 
not only boundary q@-similar but also unbiased. a 
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Example 4.9. Let X and Y be independent r.v.s having Poisson distributions with 
means A; and Az, respectively. We wish to test the hypotheses Hp : A, = Az against 
Ay +h, #Az. Let T= X + Y. The conditional distribution of X given T is the 
binomial B(T, p) where p = A, /(A; + Az). The marginal distribution of T is P(v) 
where v = A; + Az. We can therefore write the joint p.d.f. of X and T in the form 


T\ 1 
p(x, T;0,T)= — exp{O@X + tT —v}, 
x/T! 


where 6 = log(A;/A2) and t = log Az. Thus, the hypotheses under consideration are 
equivalent to Hp : 6 = 0, t arbitrary; against H, : 6 # 0, t arbitrary. 
Accordingly, we consider the two-sided test functions 


1, if X < &(T) or X > &(T) 
P(X |T)=fy(T), ifX =&(T),i =1,2 
0, otherwise. 


This test is uniformly most powerful unbiased of size a if the functions €;(7T) and 
yi(T), i = 1, 2, are determined according to the conditional distribution of X given T, 
under Ho. As mentioned earlier, this conditional distribution is the binomial B(T, 5). 
This is a symmetric distribution around Xo = 7/2. In other words, b(i; T, 5) = 
b(T —i;T, 5)s for alli = 0,..., J. Conditions (4.5.9) are equivalent to 


é-1 T 
Eeoril yl aol anil 
O) ed ae 5) +nb(67 ;) + nb (sar. 3) > bier, 5) =a, 

i=0 i=&+1 

ae 1 1 1 z 1 T 
ii ib (i:T, = b(&:T, = ble i ee let; sa = 
(ii) Dee (« 5) +yéi (a 5) + 7262 (e ; 5) + yy i (i ; 5) ws 


i=0 i=&)+1 


It is easy to verify that, due to the symmetry of the Binomial B(T, 5), the functions 
that satisfy (i) and (ii) are 


e(T) = Bo (Sir. ): 


nS FoHG), 
# _ B(E(T)—1;T, }) 
T)= 2 2 
eee BET): TD 


and y(T)=y,(T). 


’ 


Here B~'(%; T, 4) is the $-quantile of B(T, 4) and B(j; T, 4) is the c.d.f. of B(T, 4) 
at X = j. | 
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Example 4.10. In a clinical trial we test the effect of a certain treatment, in compar- 
ison to some standard treatment, at two different stations. The null hypothesis is that 
the effect of the two treatments relative to the control is the same at the two stations. 
For this objective, a balanced experiment is conducted in which 2n patients are tested 
at each station, n patients with the new treatment and 7 with the standard one. The 
observed random variables, X;; (i = 1,2; j = 1,2) are the number of successes in 
each sample of n. There are four independent binomial random variables. Let 6;; 
(i, j = 1, 2) denote the probability of success. i = 1, 2 denotes the station index and 
j = 1, 2 denotes the treatment index (j = 1 for the standard treatment and j = 2 for 
the new treatment). Thus X;; ~ B(n, 6;;). Let T; = Xi, + Xi2 Gi = 1, 2) and 


Let Y; = Xj; (@ = 1, 2). The conditional p.d-f. of Y; given 7; is the confluent hyper- 
geometric function 


py |T=th= 


where generally (5) = Oif b > a. We notice that when o; = | (i.e., 6; = 9;2), then 


the p.d.f. is the hypergeometric p.d.f. h(y | 2n, n, t) as given by (2.3.6). Thus, since 
Y, and Y> are independent, the joint conditional p.d-f. of (Y,, Y2) given T; = t and 
T) = v under (1, (2) is 


We consider the problem of testing the hypotheses: 


Ao: p1 = p2 against Hy: ep; # po. 


Our hypothesis Hp means that there is no interaction between the effect of the 
treatment and that of the station. We notice now that under Hj), S = Y,; + Y2 is a 
sufficient statistic for the family of joint conditional distributions given 7, and 7>. 
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Furthermore, the conditional p.d.f. of Y, given 7,, T>, and S is 


(lr Mtb 
EQ ML 


j=0 


POY |T =t,I,=v,S=k)= 


y=0,...,k 


where w = (;/p2. The family of all the conditional distributions of Y; given 
(T,, To, S) is an MLR family w.rt. Y;. The hypotheses Ho : 0; = p2 against Hy : 
(1 A 2 are equivalent to the hypotheses Hp : w = | against H, : w ~ 1. Accord- 
ingly, the conditional test function 


1, if Y < & (1%, To, S) or Y, > &(N, Th, S) 
a%i1T%,In,)= 4v, WY = (N, Tr, S),i= 1,2 
0, otherwise, 


is uniformly most powerful unbiased of size a, if the functions § (7), 7», S) and 
yi(T, Tz, S) are determined to satisfy conditions (i) and (ii) of (4.5.9) simultaneously. 
To prove it, we have to show that the family of conditional joint distributions of S' given 
(T,, T>) is complete and that the power function of every test function is continuous 
in (611, 912, 921, 622). This is left to the reader as an exercise. For the computation of 
the power function and further investigation, see Zacks and Solomon (1976). a 


Example 4.11. A. In this example we show that the t-test, which was derived in 
Example 4.9, is uniformly most powerful unbiased. An m.s.s. for the family of normal 
distributions F = {N(, 07); 00 < uw < 00,0 <o < oo}is(XX;, DX?). LetU -_ 


1 
—XX; and T= Exe, We notice that T is an m.s.s. for F* (the family restricted to 
n 


1 1/2 
the boundary, 44 = 0). Consider the statistic W = /n U/ (7 —nU ») ‘ 
n— 


We notice that if 4 = 0, then W ~ t[n — 1] independently of o*. On the other 
hand, T ~ o7x7[n] when yz = 0. Therefore, according to Basu’s Theorem, W and 
T are independent for each 6 € ©* (the boundary) since the family F7 is complete. 
Furthermore, W is an increasing function of U for each T. Hence, the t-test is 
uniformly most powerful unbiased. 

B. Consider part (B) of Example 4.9. The ms.s. is (2Xj, DX?, LY;, 
Sy: uX;Y;). If we denote by F* the family of all bivariate normal distributions 
with p = 0 (corresponding to the boundary), then T = (2 X;, x, xY;, SY?) is an 
m.s.s. for F*. Let U = © X;Y;. The sample correlation coefficient r is given by 


r= WUU,T) = [nU — (2X))(ZY;)/[n UX? — (2X;)7]!” - (nD? — (LY,)7]'”. 
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This function is increasing in U for each T. We notice that the distribution of r 
is independent of v = (41, [42, 01, 02). Therefore, r is independent of T for each v 
whenever p = 0. The test function #(r) of Example 4.9 is uniformly most powerful 
unbiased to test Hp : p < 0, v arbitrary, against H; : op > 0, v arbitrary. | 
Example 4.12. Consider again the components of variance Model II of Analysis of 
Variance, which is discussed in Example 3.9. Here, we have a three-parameter family 


of normal distributions with parameters 1, 0”, and t*. We set p = t7/o?. 
A. For testing the hypotheses 


Ap: w<0, v= te. p) arbitrary, 
against 
M:uw>d0, v= (o’, p) arbitrary, 


the t-test 


1/2 = t_olr ~~ 1] 


0, otherwise 


is a uniformly most powerful unbiased one. Indeed, if we set U = 73(X) = x ,Tr= 
: 1/2 
(T,(X), T(X)), then W(U, T) = Jnr U/ [be = a | is distributed 
r= 
i=l 
when pw = 0, as t[r — 1] for all (7, p). The exponential family is complete. Hence, 
WUU, T) and T are independent for each (a7, e) when jz = 0. Furthermore, W(U, T) 


is an increasing function of U for each T. 
B. For testing the hypotheses 


Ay: p <1, (0°, LL) arbitrary 
against 

AL: p>, (o?, lL) arbitrary 
the test function 


1, ifW> Fi_eir-lram-— I] 
0, otherwise 


o(W) = | 
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is uniformly most powerful unbiased. Here 
W =or(n— 1) (Ki - XP /0 - DD) OK - Ki, 
i=l i=l j=l 


and Fi_[r — 1, r(n — 1)] is the (1 — @)-quantile of the central F-distribution with 
(r — 1) and r(n — 1) degrees of freedom. | 


Example 4.13. Let X ~ N(@, 1). We consider the two simple hypotheses Hp : 0 = 0 
versus H, : 0 = 1. The statistic A(X) is 


A(X) = FEO) = 1/max(1, e*-"/?), 


max{ f(X;0), f(X; 1} 


Obviously, A(X) = | if and only if X < 5. It follows that, under 0 = 0, Po[A(X) = 
1]= (5) = .691. Therefore, in this example, the generalized likelihood ratio test 
can be performed only for a < 1 — .691 = .309 or for a = 1. This is a restriction on 


the generalized likelihood ratio test. However, generally we are interested in small 


values of a, for which the test exists. | 
Example 4.14. Let X;,..., X,, bei.i.d. random variables having a common Laplace 
distribution with p.d.f. 


Flos) = 5 expl |x — 6}, CO <x < 00, 
where the parameter space is © = {—00 < @ < oo}. We wish to test 
H:¢9=0 
against 


M:040. 


n 
The sample statistic 6, which minimizes Sols — 6|, is the sample median 


i=1 


X(m+1)s ifn = 2m + 1 
1 
5 Xm + X(m+1))s ifn = 2m. 
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Thus, the generalized likelihood statistic is 


MXn) = exp {- (> Iti — Do lai - ui} 
i=l i=] 


A(X,,) is sufficiently small if 


T(X,) = D0 1X — D5 1X) — MI 
i=1 i=1 


is sufficiently large. To obtain a size aw test, we need the (1 — w)-quantile of the 
sampling distribution of T(X,) under 6 = 0. M = 1000 simulation runs, using 
S-PLUS, gave the following estimates of the .95th quantile of T(X,,). 

Notice that —2 log A(X,,) = 2- T(X,,) and that venus = 3.8415. Also 2P 95. = 
X$s[11.- 


n 20 50 100 
Tosn 1.9815 2.0013 1.9502 


Example 4.15. Fleiss (1973, p. 131) gave the following 2 x 2 table of G-6-PD 
deficiency (A) and type of schizophrenia (B) among N = 177 patients. 


B Catatonic Paranoid x 
A 
Deficient 15 6 21 
Non-deficient 57 99 156 
x 72 105 177 


We test whether the association between the two variables is significant. The 
X° statistic for this table is equal to 9.34. This is greater than xe] = 3.84 and 
therefore significant at the a = .05 level. To perform the conditional test we compute 
the hypergeometric distribution H(N, T, S) with N = 177, T = 21, and S§ = 72. In 
Table 4.3, we present the p.d.f. h(x; N, T, S) and the c.d.f. H(x;N, T, S) of this 
distribution. 

According to this conditional distribution, with a = .05, we reject Hy whenever 
X <4orX > 14. If X =5 we reject Hp only with probability y; = .006. If X = 13 
we reject Hp with probability y2 = .699. In this example, X = 15 and therefore we 
conclude that the association is significant. a 
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Table 4.3 The Hypergeometric Distribution H (177, 21, 72) 


x h(x; N, T, S) H(x;N,T,S) 
0 0.000007 0.000007 

1 0.000124 0.000131 

2 0.001022 0.001153 

3 0.005208 0.006361 

4 0.018376 0.024736 

5 0.047735 0.072471 

6 0.094763 0.167234 

7 0.147277 0.314511 

8 0.182095 0.496607 

9 0.181006 0.677614 
10 0.145576 0.823190 
11 0.095008 0.918198 
12 0.050308 0.968506 
13 0.021543 0.990049 

Example 4.16. Let X,, X2,... be a sequence of i.i.d. random variables having 


a common normal distribution N(0, 1), —oo < 6 < oo. Suppose that for testing 
the hypothesis Hy : 6 < 0 against H, : 6 > 1, we construct the Wald SPRT of the 
two simply hypotheses Hj : 6 = 0 against Hj‘: 6 = 1 with boundaries A’ and B’ 
corresponding to a = .05 and B = .05. 

Notice that 


fiona 


— 2 24 = 
PME 5X - 1? — 7] = X 1/2. 


Accordingly, 


The m.g.f. of Z at 0 is 


1(x-} Z 1 
Mo(t) = Eo{e*~ 2} = exp Bit ats Pe: 


Thus, f(0) = 1 — 20, and from (4.8.26)-(4.8.27), the acceptance probabilities are 
19°" — 1 


(0) = 4 191-20 — 1920-1” OF. 
5 6=.5. 
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In the following table we present some of the 7(@) and E,{N} values, determined 
according to the approximations (4.8.15) and (4.8.26). 


0 =I —.5 0. .25 50 75 1 1:5 2.0 
m(0) 99985 99724 §=.95000 .81339 =.50000 = .18601 .05000 .00276 .00015 
Eo{N} 2.0 2.9 5.3 TA 8.7 7.4 5.3 2.9 2.0 


The number of observations required in a fixed sample design for testing Hj : 
0 = Oagainst Hy} : @ = 1 witha = 6 = .05 isn = 16. According to the above table, 
the expected sample size in a SPRT when 6 = 0 or 0 = 1 is only one third of that 
required in a fixed sample testing. a 
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Section 4.1 


4.1.1 Consider Example 4.1. It was suggested to apply the test statistic 6(X) = 
I{X < 18}. What is the power of the test if (() 6 = .6; (ii) 0 = .5; (ili) 
6 = .4? [Hint: Compute the power exactly by applying the proper binomial 
distributions. ] 


4.1.2 Consider the testing problem of Example 4.1 but assume that the number of 
trials isn = 100. 
(i) Apply the normal approximation to the binomial to develop a large 
sample test of the hypothesis Ho : 6 > .75 against the alternative H : 
6 <.75. 
(ii) Apply the normal approximation to determine the power of the large 
sample test when 6 = .5. 
(iii) Determine the sample size n according to the large sample formulae so 
that the power of the test, when 6 = .6, will not be smaller than 0.9, 
while the size of the test will not exceed a = .05. 


4.1.3 Suppose that X has a Poisson distribution with mean 2. Consider the hypothe- 
ses Hy : 4 = 20 against H, : 4 A 20. 
(i) Apply the normal approximation to the Poisson to develop a test of Ho 
against H, at level of significance a = .05. 
(ii) Approximate the power function and determine its value when A = 25. 


Section 4.2 
4.2.1 Let X,,...,X, be iid. random variables having a common negative- 
binomial distribution N B(y, v), where v is known. 


(i) Apply the Neyman—Pearson Lemma to derive the MP test of size a of 
Ao: wv < Wo against Hy : WwW > WW, where0 < Wo < Wh <1. 
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4.2.2 


4.2.4 


4.2.5 


(ii) What is the power function of the test? 


(iii) What should be the sample size n so that, when Wo = .05 and a = .10, 
the power at w = .15 will be 1 — 6 = .80? 


Let X 1, X0,...,X, be iid. random variables having a common dis- 
tribution F, belonging to a regular family. Consider the two sim- 
ple hypotheses Hjp:0=0) against H,:0=6);; 0) 46). Let o? = 
vate, {log( f(X1; 61)/f (X13 ))}, i =0,1, and assume that 0 < o. < OM, 
i = 0, 1. Apply the Central Limit Theorem to approximate the MP test and its 
power in terms of the Kullback—Leibler information functions /(6, 6;) and 
I(@,, 89), when the sample size n is sufficiently large. 


Consider the one-parameter exponential type family with p.d.f. f(x; py) = 
h(x) exp{wx, — K(yW)}, where K(w) is strictly convex having second deriva- 
tives at all W € Q,ie., K’(W) > 0 forall & € Q, where Q is an open interval 
on the real line. For applying the asymptotic results of the previous problem 
to test Hy: w < Wo against H, : W > W1, where Wo < Wj, show 


(i) Ey, flog( f(X; Wi)/F(X3 Wo))} = K' (Wi) - 1 — Wo) — (KW) — 
K(Wo)); i = 0, 1. 
(ii) Vary, dog( f(X; Wi)/f(X; Wo)} = Qh — Wo) Ki); i = 0, 1. 
(iii) If Z; = log( f(X;; eee j=1,2,... where X1, X2,..., Xn 


are iid. and Z, = ya then the MP test of size a for Hy : W = Wo 
j=l 
against Hy : w = yw is asymptotically of the form (Zn) = 1{Z, = 


Ia}, where [y= K'(Wo)h — Wo) — (KW) — Ko) + 
Ja 


Vo K"(Wo))'? and z1-4 = ®7!(1 — a). 
(iv) The power of the asymptotic test at yw is approximately 


K" (Wo) \'7 
® = (K* K — Z1-e , 
(Var (Wi) — K'(Wo)) — 1 (ee) 


(v) Show that the power function given in (iv) is monotonically increasing 


in Wi. 


Let X,, X2,..., X, be iid. random variables having a common negative- 
binomial distribution N B(p, v), v fixed. Apply the results of the previous 
problem to derive a large sample test of size a of Hp : p < po against Hy : 
pP=p1.9<po<pi<l. 


Let X,, X2,..., X, be ii.d. random variables having a common distribution 
with p.d-f. f(x; “, 0) = (1 — @)G(x) + OP — 1), —00 < x < 00, where 
is known, 4p > 0; 0 < 6 < 1; and ¢(*) is the standard normal p.d_f. 
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(i) Construct the MP test of size a of Hy): 0=0 against H, :6 = 0), 
0<0, <1. 


(ii) What is the critical level and the power of the test? 


4.2.6 Let X,,..., X, bei.i.d. random variables having a common continuous dis- 
tribution with p.d.f. f(x; @). Consider the problem of testing the two simple 
hypotheses Hp : 8 = 09 against H, : 0 = 0,, 0) ~ &. The MP test is of the 


form $(x) = I{S, > c}, where S, = S > log( f (X13 1)/f (X15 6). The two 


i=1 
types of error associated with ¢, are 


€o(c) = Po{S, => c} and e;(c) = P\{S, < c}. 


A test @,« is called minimax if it minimizes max(€0(c), €;(c)). Show that ¢,« 
is minimax if there exists a c* such that €9(c*) = €;(c*). 


Section 4.3 
4.3.1 Consider the one-parameter exponential type family with p.d.f.s 
f (x30) = h(x) exp{Q(@)U(x)+ C(/)}, GEO, 
where Q’(6) > 0 for all 0 € ©; O(@) and C(@) have second order derivatives 


atall9 € O. 
(i) Show that the family F is MLR in U(X). 
(ii) Suppose that X1,..., X,, are i.i.d. random variables having such a dis- 


tribution. What is the distribution of the m.s.s. T(X) = You(x i)? 
j=l 


(iii) Construct the UMP test of size a of Hp : 0 < 9 against H, : 0 > Op. 


(iv) Show that the power function is differentiable and monotone increasing 
in @. 


4.3.2 Let X,,..., X, bei.id. random variables having a scale and location param- 
eter exponential distribution with p.d_f. 


1 1 
f(x5 HM, 0) = = exp jae — wh 1 > pw}; 
or o 
0O<0< we; -wWM<UL<w. 


(i) Develop the a-level UMP test of Ho : uw < Mo, against 4 > Wo When o 
is known. 

(ii) Consider the hypotheses Hp : uw = fo, o = Oo against Hy : uw < Lo, 
o <0. Show that there exists a UMP test of size a and provide its 
power function. 
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4.3.3 Consider n identical systems that operate independently. It is assumed that the 
1 
time till failure of a system has a G (5. 1) distribution. Let Y,, Yo,..., Y,. 


be the failure times until the rth failure. 


(i) Show that the total life T,,, = )°¥; +(n—r)¥, is distributed like 
i=l 
7 
5X 22r. 
(ii) Construct the a-level UMP test of Hp : 9 < 09 against H; : 0 > Oo based 
on Ty. 
(iii) What is the power function of the UMP test? 


4.3.4 Consider the linear regression model prescribed in Problem 3, Section 2.9. 
Assume that a and o are known. 


(i) What is the least-squares estimator of 6? 
(ii) Show that there exists a UMP test of size aw for Hp : B < Bo against 
B > Bo. 
(iii) Write the power function of the UMP test. 


Section 4.4 


4.4.1 Let X,,..., X, be iid. random variables having an N(O, o”) distribution. 
Determine the UMP unbiased test of size a of Hp : 0? = OG against Hy : 
o2 FH Gas where 0 < OG < OO. 


4.4.2 Let X ~ B(20,6),0 <0 < 1. Construct the UMP unbiased test of size a = 
.05 of Hy : 6 = .15 against H,; : 0 A .15. What is the power of the test when 
0 = .05, .15, .20, .25 


4.4.3 Let X,,..., X, be ii.d. having a common exponential distribution Gf, 1), 
0 < @ < oo. Consider the reliability function » = exp{—t/@}, where ¢ is 
known. Construct the UMP unbiased test of size aw for Ho : op = po against 
A, : ep ¥ po, for some 0 < po < 1. 


Section 4.5 


1 
4.5.1 Let X1,..., X, bei.i.d. random variables where X; ~ & + G(—, 1), —co < 
o 


E < 00, 0 <o < oo. Construct the UMPU tests of size a and their power 
function for the hypotheses: 


(i) Ho: € < &o, o arbitrary; H, : € > &, o arbitrary. 
(ii) Hp : o = oo, & arbitrary; H; : 0 4 oo, & arbitrary. 


4.5.2. Let X),..., X» be i.i.d. random variables distributed like N(j11, 07) and let 
Y,..., Y, be iid. random variables distributed like N(x, 07); —0o < fy, 
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4.5.3 


4.5.4 


4.5.5 


4.5.6 
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[ly < 00; 0 < a” < ow. Furthermore, the X-sample is independent of the 
Y-sample. Construct the UMPU test of size a for 


(i) Ap : wy = M2, o arbitrary; Hy : wy ~ M2, o arbitrary. 
(ii) What is the power function of the test? 


Let X;,..., X, bei.i.d. random variables having N(u, o”) distribution. Con- 
struct a test of size a for Hy : 4 +20 > 0 against 4+ 20 < 0. What is the 
power function of the test? 


In continuation of Problem 3, construct a test of size a for 


Ap: by +52 < 10, o arbitrary; 
AL: uy +52 > 10, o arbitrary. 


Let (X,, X2) have a trinomial distribution with parameters (n, 0;, 62), where 
0 < 6), 62 < 1, and 6; + 6) < 1. Construct the UMPU test of size a of the 
hypotheses Ho : 6; = 62; Hi, : 0; 4 62. 


Let X;, X2, X3 be independent Poisson random variables with means A1, A2, 
A3, respectively, 0 < A; < oo (i = 1, 2,3). Construct the UMPU test of size 
a of Hp: A; =Axr =A3; A, Ay >A > Az. 


Section 4.6 


4.6.1 


4.6.2 


Consider the normal regression model of Problem 3, Section 2.9. Develop the 
likelihood ratio test, of size €, of 
(i) Hp : a = 0, B, o arbitrary; H, : a 4 0; B, o arbitrary. 
(ii) Ho : B = 0, a, o arbitrary; H, : 8 40; a, o arbitrary. 
(iii) o > 00, a, B arbitrary; H, : 0 < 00; a, f arbitrary. 


Let (X, 8), 0, (XE, S?) be the sample mean and variance of k independent 
random samples of size n,,..., x, respectively, from normal distributions 
N(u, 07), i=1,...,k. Develop the likelihood ratio test for testing Hp : 
0, =+++= 0x3 [1,..., Ue arbitrary against the general alternative Hj : 
Ol,--+5 


k 2 
ox, and f1,..., 4x arbitrary. [The test that rejects Hy when yoni log 2 > 


i=1 i 
i k 
x7_,[k — 1], where Ss? SE 2% — 1)S? and N= sone is known 


= i=l 
as the Bartlett test for the equality of variances (Hald, 1952, p. 290).] 
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4.6.3 


4.6.6 


Let (X,,..., Xz) have a multinomial distribution MN(n, 0), where 0 = 
k 


(0;,...,9%), 0< 9; <1, S46; = 1. Develop the likelihood ratio test of 


i=1 


Ap: =--- == i against H, :@ arbitrary. Provide a large sample 
approximation for the critical value. 
Let (X;, Y;),i = 1,...,n, be iid. random vectors having a bivariate normal 
1 
2 


distribution with zero means and covariance matrix Y =o 2 ae 0< 
a”? <0, —1 < p < 1. Develop the likelihood ratio test of Hy : p = 0, o 
arbitrary against H, : p 4 0, o arbitrary. 


Let (411, Yi1),.--, (Xin, Yin) and (%21, Yo1),..., (X2n, Yon) be two sets of 

independent normal regression points, i.e., Yj; ~ N(ay + Bix;,07), i 

1,...,nand ¥2; ~ N(@2 + Box;307), wherex = (xy, ..., Xn)! and x?) = 

(X21, ...,X2n) are known constants. 

(i) Construct the likelihood ratio test of Hp : a} = a2, Bi, 62, o arbitrary; 
against H; : a, #2; 6), Bo, o arbitrary. 

(ii) Ho : 6) = fo, a1, @2 arbitrary; against H; : 6, ~ B23 a, @2, o arbitrary. 


The one-way analysis of variance (ANOVA) developed in Section 4.6 corre- 
sponds to model (4.6.31), which is labelled Model I. In this model, the incre- 
mental effects are fixed. Consider now the random effects model of Example 
3.9, which is labelled Model II. The analysis of variance tests Ho : t=0, 
o arbitrary; against H : tT >0,0° arbitrary; where T? is the variance of the 
random effects a,,...,a,. Assume that all the samples are of equal size, i.e., 
n= Sn =n. 


(i) Show that Ss? — aah and Se = SK - xy are independent. 
i=l i=l 
(ii) Show that S? ~ (07 +n07)x7[r — 1]/ — 1). 
(iii) Show that the F-ratio (4.6.27) is distributed like (1 + ny) F(r—1, 
r(n — 1)]. 
(iv) What is the ANOVA test of Hp against H)? 
(v) What is the power function of the ANOVA test? [Express this function 


in terms of the incomplete beta function and compare the result with 
(4.6.29)-(4.6.30).] 


Consider the two-way layout model of ANOVA (4.6.33) in which the incre- 


mental effects of A, tps — cs are considered fixed, but those of B, 
ih aentey o , are considered i.i.d. random variables having a N(O, op) distribu- 


tion. The interaction components Tie are also considered 1.i.d. (independent 
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of t B ) having a N(O, o% z) distribution. The model is then called a mixed 
effect model. Develop the ANOVA tests of the null hypotheses 


What are the power functions of the various F-tests (see Scheffé, 1959, 
Chapter 8)? 


Section 4.7 


4.7.1 Apply the X?-test to test the significance of the association between the 
attributes A, B in the following contingency table 


Aj Ad A3 Sum 


B, 150 270 500 920 
By 550 1750 300 2600 
Sum 700 2020 800 3520 


At what level of significance, a, would you reject the hypothesis of no 
association? 


4.7.2 The X?-test statistic (4.7.5) can be applied in large sample to test the equal- 
ity of the success probabilities of k Bernoulli trials. More specifically, let 
fi,---> fx be independent random variables having binomial distributions 
B(n;, 6;), i =1,...,k. The hypothesis to test is Hp : 0; =--- = =0,0 


arbitrary against H, : the @s are not all equal. Notice that if Hp is correct, then 
k k 


T= Vii ~ B(N,@) where N = Yoni. Construct the 2 x k contingency 


i=l i=1 


table 
E, tee Ex Total 
S fi a fi T 
a eee ee ee 
Total ny Nk N 


This is an example of a contingency table in which one margin is fixed 
(n1,...,,) and the cell frequencies do not follow a multinomial distribution. 
The hypothesis Ho is equivalent to the hypothesis that there is no association 
between the trial number and the result (success or failure). 
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4.7.3 


(i) Show that the X? statistic is equal in the present case to 


(ii) Show that if n; > oo for alli =1,...,k so that a >i1j;,0 <A; < 1 
for alli = 1,...,k, then, under Ho, X? is asymptotically distributed like 
2 
xk — 1). 


The test statistic X*, as given by (4.7.5) can be applied to test also whether a 
certain distribution Fo(x) fits the frequency distribution of a certain random 
variable. More specifically, let Y be a random variable having a distribution 
over (a,b), where a could assume the value —oo and/or b could assume 
the value +oo. Let no < 9) <--- < mm with no = a and m = b, be a given 
partition of (a, b). Let f; = 1,...,k) be the observed frequency of Y over 


n 
(ni-1, ni) among N i.i.d. observations on Yj,..., Yn, ie. fi = YoHni-1 < 
j=1 


Y; < ni},i =1,...,k. We wish to test the hypothesis Hp : rats = Fo(y), 
where Fy(y) denotes the c.d.f. of Y. We notice that if Hp is true, then the 
expected frequency of Y at [nj-1, ni] is e; = N{Fo(ni) — Fo(ni-1)}. Accord- 
ingly, the test statistic X7 assumes the form 


k 
X? = \° f?/NLFo(ni) — Fo(n-)1 — N. 


i=1 


The hypothesis Hp is rejected, in large samples, at level of significance a if 
X? > x?_,[k — 1]. This is a large sample test of goodness of fit, proposed in 
1900 by Karl Pearson (see Lancaster, 1969, Chapter VIII; Bickel and Dok- 
sum, 1977, Chapter 8, for derivations and proofs concerning the asymptotic 
distribution of X* under Hp). 

The following 50 numbers are so-called “random numbers” generated by 
a desk calculator: 0.9315, 0.2695, 0.3878, 0.9745, 0.9924, 0.7457, 0.8475, 
0.6628, 0.8187, 0.8893, 0.8349, 0.7307, 0.0561, 0.2743, 0.0894, 0.8752, 
0.6811, 0.2633, 0.2017, 0.9175, 0.9216, 0.6255, 0.4706, 0.6466, 0.1435, 
0.3346, 0.8364, 0.3615, 0.1722, 0.2976, 0.7496, 0.2839, 0.4761, 0.9145, 
0.2593, 0.6382, 0.2503, 0.3774, 0.2375, 0.8477, 0.8377, 0.5630, 0.2949, 
0.6426, 0.9733, 0.4877, 0.4357, 0.6582, 0.6353, 0.2173. Partition the interval 
(0, 1) to k = 7 equal length subintervals and apply the X° test statistic to test 
whether the rectangular distribution R(0, 1) fits the frequency distribution of 
the above sample. [If any of the seven frequencies is smaller than six, combine 
two adjacent subintervals until all frequencies are not smaller than six.] 
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4.7.4 In continuation of the previous problem, if the hypothesis Ho specifies a 


distribution F(x; 6) that depends on a parameter 6 = (6),...,6,), i <r, but 
the value of the parameter is unknown, the large sample test of goodness of fit 
compares 


k 
X= > f?/NLF (138) — F(i-136)] — N 


i=1 


with X7_ AK — 1-—r] (Lancaster, 1969, p. 148), where 6 are estimates of 0 
obtained by maximizing 


k 
QO => filoglF(ni:@) — F(ni-1;8)1- 


i=l 


(i) Suppose that yy =O<n, <-:-<m=oco and F(x;0)=1- 
exp{—x/o}, 0 <o < mw. Given m,..., me—1 and fi,..., fxr, N, how 
would you estimate o? 

(ii) What is the likelihood ratio statistic for testing Hp against the alternative 
that the distribution F is arbitrary? 

(iii) Under what conditions would the likelihood ratio statistic be asymptot- 
ically equivalent, as N — ov, to X (see Bickel and Doksum, 1977, p. 
394)? 


4.7.5 Consider Problem 3 of Section 2.9. Let (X1;, X2;),i = 1,..., be a sample 
of n i.i.d. such vectors. Construct a test of Hp : p = 0 against H; : p £0, at 
level of significance a. 


Section 4.8 


4.8.1 Let X), X2,... be a sequence of i.i.d. random variables having a common 
binomial distribution B(1, 0),0 < 6 < 1. 
(i) Construct the Wald SPRT for testing Ho : 6 = 0 against H, : 6 = 4), 
0 < 6) < & < 1, aiming at error probabilities @ and £, by applying the 
approximation A’ = log B(1 — a) and B’ = log(1 — B)/a. 
(ii) Compute and graph the OC curve for the case of 6) = .01, 6; = .10, 
a = .05, 6 = .05, using approximations (4.8.26)-(4.8.29). 
(iii) What is Eg{N} for 6 = .08? 


4.8.2 Let X), X2,... be a sequence of i.i.d. random variables having a N(O, a”) 
distribution. Construct the Wald SPRT to test Hp : 0? = 1 against Hy : o* =2 
with error probabilities a = .01 and 6 = .07. Whatis m(o7)and E,2{N} when 
o* = 1.5? 
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4.1.2 The sample size is n = 100. 
(i) The large sample test is based on the normal approximation 


(X) = 1{X < 100 x 0.75 — Z;_gV100 x 0.75 x 0.25} 
= {X= 19 Zp 4 33}, 


(ii) The power of the test when 6 = 0.50 is 


W(0.5) = Pos(X < 75 — Z1-¢4.33) 


ig ( SAB87 ig = =) 
> J100 x 5x 5 


(iii) We have to satisfy two equations: 


C—nx 0.75 
(i) ® ( che ) < 0.05 
Jn x 0.75 x 0.25 


(ii) ® (5) > 0.9. 
Jn x 0.6 x 0.4 


The solution of these 2 equations is nm = 82, C = 55. For these values 
n and C we have a = 0.066 and w(0.6) = 0.924. 


4.2.1 Assume that v = 2. The m.s.s. is T = ox: ~ NB(W, 2n). 
i=l 


(i) This is an MLR family. Hence, the MP test is 
1, iff > NB7'!(1—a; Wo, 2n) 
O°(T) =} Ya, T = NB'(1— a; Yo, 2n) 
0, T <NB-!(1—a:; Wo, 2n). 
Let T}_o(Wo) = NB7!(1 — a; 2n, Wo). Then 


_ L-a— PIT < Tr_aho) — I} 
ler PIT = Tao} 


(ii) The power function is 


Power(w) = Py{T > Ti-a(Wo)} + YaPytT = Ti-a(Wo)}. 
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We can compute these functions with the formula 
Py{T <t} = K_y(Qn,t +1). 


(iii) We can start with the large sample normal approximation 


PT <1) (: = 2nw/( - >) 
. V2ny/1 = vP 


-9 (Co). 


J2nw 
For wo =0.05, a =0.10, w, =0.15, 1—6=0.8, we get the 
equations 


(i) 0.95t¢ — 2n x 0.05 = 1.2816./2n x 0.05. 
Gi) 0.85¢ — 2n x 0.15 = —0.8416./2n x 0.15. 
The solution of these equations is n = 15.303 and t = 3.39. Since 


n and t are integers we take n = 16 and t = 3. Indeed, Poo5(T < 3) = 
T95(32, 4) = 0.904. Thus, 


1, ifT > 3 
p(T) = 40.972, if T =3 
0, ifT <3. 


The power at y = 0.15 is 


Power(0.15) = Po.is{T > 3} +0.972Po.15{T = 3} 


= 0.8994. 
For 7 > Wo, 
X; 
aa = exp{(W1 — Wo)X — (K(W1) — K(Wo))}. 
f(X; a 
E; \ log ———~¢- = E; — Wo)X}—(K my 
(i | 08 Fahy) | — PAM — ¥odX} — (KOH) — KWo} 


= (1 — Wo) K' (Wi) — (K (1) — K()). 
Since E;{X} = K'(W;),i = 0, 1. 


CS) ee eee 
Vv {log ax = ( — Woy VEX} 


(ii) 
=(Wi — Woy Ki), i =0,1. 
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(iii) The MP test of Hy versus H; is asymptotically (large n) 
b(Zn) = M{Zn = hi — Wo) KWo) — (KW) — K (Wo) 


1 
“fe Zia = Wo K"(o))'/7}. 


(iv) 
Power(yW) = Py, Zn = la} 


(“ — Wo) K'(Wo) — KW) + Zia — Wo) = 
=1-6 
yah — Wo)V Ki) 


K'(Wi) = KWo) [K"(Wo) 
= @® Z —a . 
(va VI) ra) 


4.3.1 Since Q’(0) > 0, if 0; < 0) then Q(02) > Q(O;) 


St (X36) 
S(X3 41) 


Gi) F ={f(X;9),90€ ©} is MLR in U(X) since exp{(Q(2) — 
Q(A1))U(X)} 7 U(X). 


T(X) = )°U(X)) 
j=l 


= exp{(Q(42) — Q(A1))U(X) + C(@2) — C1}. 


i) 
f(%;0) = | [ nx) - exp | Q(6) )\ U(X;) + reo] 
j=1 
Thus, the p.d_f. of T(X) is 


j=l 


f(t, 0) = exp(Q@)t + nC(O)) AC; @) 
exp(Q(9)t) 
[exo@nan 


= exp(Q(9)t — K(@)), 


where 
K(0) = log (/ explQ(oyniditt) ; 


This is a one-parameter exponential type density. 
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(iii) The UMP test of Hp : 6 < Op versus H; : 6 > Oo is 
1, if T > Cy(O%) 
OT) =4¥a, if T = Cy) 
0, if T < Cy(), 


where C, (9) is the (1 — a) quantile of T under 6. If T is discrete then 


_ L-a— Pa{T < Ca(6o) — 0} 
oS Pa{T = Ca(6o)} 


(iv) 


Power(0) = E9{¢°(T)} 
= i exp{Q(@)t — K(0)}dA(t) + yo Pa{T = Cu(o)}. 
Ca(60)+ 


By Karlin’s Lemma, this function is increasing in 0. It is differentiable 
wrt. 8. 


7 
4.3.3 (i) T,, T2,...,T, are iid. @0GQU,1)~ 7X (21. Define the variables 
Y,,..., Y,, where for Ti1) < --- < Ty) 


Y; ~ Ty), i=l,...,n 
0 

Y, ~ —Gd, 1) 
n 


0 independent 
(2 — Yi) ~ — GA, 1) = 


increments. 
io n—t + 1 : 


Accordingly, 


0 
ny, +(n— (2. -Y)+--- + —r +1 — ¥-1) ~ 5X (rl. 


“ ) 
The left-hand side is equal to yy, + (n—r)Y,. Thus T,, ~ 5X 22r'. 
i=l 


(ii) Since the distribution of T,,,, is MLR in T,,,, the UMP test of size @ is 


O°(Tar) =I |, > Fx? at2n} 
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(iii) The power function is 


0 6 
wo) = PLS x202r1 = Px? ,t2r1} 


=p {een > Fd at2n} : 


4.4.2 X ~ B(20,6). H):6 =.15, MH, : 6 4 .15,a = .05. The UMPU test is 


1, X > C(O) 

yz, X = Cr() 
b°(X)= 40, C,(0) < X < Cr) 

yi, X = C(O) 

1, X < C,(). 


The test should satisfy the conditions 
(i) E.1s{p°(X)} = @ = 0.05. 
(ii) E1s{X°(X)} = wE 15{X} = 15. 


20 20! 
X( Jag. — 6)* = Og. (1 — 9)°°-* 
(%) oe = gamies=aie 


19 
= 206 Oo "(1 = 69)°0-* 
0 fee :) 0 ( ‘) 


= 3b(X — 1:19, @%). 


Thus, we find Ci, 71, C2, 72 from the equations 
(i) > b(j; 20, .15) + C1 b(Cy; 20, .15) 


i<X1 


+ yob(C2;20, 15) + J ~ b(j;20, .15) = .05. 


j>X2 


(ii) > b(j; 19, .15) + 71b(C; 19, .15) 


J<X1 


+ yob(C2;19, 15) +S b(j;19, .15) = 0.15. 


j>X2 
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We start with C, = B~!(.025;20,.15)=0, C. = B7'(.975; 


20, 15) =6 
025 
B(0;20, .15) = 0.3875, y, = = 6452 
0.03875 
975 — 9327 
B(6;20,.15)= .9781, = ———____ 
9781 — .9327 
B(5;20, .15) = 9327,  y, = .932. 


These satisfy (i). We check now (ii). 
y¥, BO; 19, .15) + ~2b(6; 19, .15) + C1 — BG; 19, .15)) = 0.080576. 
We keep C1, y1, and C2 and change 7» to satisfy (ii); we get 

Y3 = 0.1137138. 


We go back to 1, with y; and recompute y;. We obtain yj = .59096. 
With these values of y; and y;, Equation (i) yields 0.049996 and 
Equation (ii) yields 0.0475. This is close enough. The UMPU test is 


1, if X >6 
0 _ j0.1137, ifxX =6 
eS 0, if0< X <6 


0.5909, if X =0. 


4.5.1 X,...,X, are iid. ~ uw +oG(, 1), —c <u <w,0<0 < ow. The 
m.s.s. is (2X1), U), where 


U =) \(Xw — Xw) ~ eG, n= 1) 
i=2 


nXay ~nu+oGi1, 1) 
X 1) is independent of U. 


(i) Find the UMPU test of size a of 
Ay: < Mo, © arbitrary 
against 


My: u> lo, © arbitrary. 
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The m.s.s. for F* is T = OX = U + nXq). The conditional distribu- 


i=l 
tion of nX(1) given T is found in the following way. 


(ii) T ~ w + oG(1, n). That is, 


es 1 t-—p ve t-—p 5s 
frit) = —— ( oO ) exp(-—*), =i 


The joint p.d.f. of nX(1) and U is 


1 YH 1 n—-2,,-u/o 
Sax uQ W) = = exp( 5 ) ‘pata I(y 2 pw) 


1 y-pe u 
o"T(n — 1) exp ( o “) Y2H) 


We make the transformation 


Dee vi J 
t=ytu u=t-—y 


The joint density of (1X1), T) is 


I iy dete 
&rxwt = SEG ope — yy" Re re Ue Sy 


Thus, the conditional density of nX 1), given T, is 


haxyiry | t) = ae 
HOES wr 
(t — py! 
nerd ithe ¢ aw tee er 
(t—p) (t—py" 


n—2 
ed (1 2 w<y<t 
(t — p) t— : 


I(t = wily = 4) 


yr-2 


It>y>p) 


The c.d.f. of nX (1) | T, at u = po is 


n—-1 

— Ho 

fay |) = (1-282 ) , po<y<t. 
t— Lo 
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The (1—a)-quantile of Ap(y|t) is the solution y _ of 
n—-1 
ie —- = 1—a, which is C,(uo, t) = wo + (t — wo) — 
—u 


0 
(1 — a)!/@-)), The UMPU test of size a is 


1, ifnXa) > Co(uo, T) 
0, otherwise. 


P'(nXqy | T)= | 
The power function of this test is 


ww) = Pi{nXy = Co(uo, T)} 


n—-1 
~1-5,{(1- aee7i=2) | 
—Hw 


T —w~ oG(1,n). Hence, for é = 1—(1—a)/@-, 


ww) =1- ail (1 gp", Lab tetas 
Pn) Jo oO y 


=1—(1—a)e’P(n — 1;5), 


where 6 = (44 — o)/o). This is a continuous increasing function of jw. 
(iii) Testing 


Hj):o0 =0o, wm arbitrary 


H,:o0 #00, w arbitrary. 


The m.s.s. for F* is X(1). Since U is independent of X 1), the conditional 
test is based on U only. That is, 


1, U >Cy(o) 
~(U)= 40, Cr<U<G 
1, U <C(o). 


, 00 2 00 2 
We can start with C (09) = y Xa [20 — 2] and C2(a) = 5X 
2 


I-$ 
[2n — 2]. 
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4.5.4 (i) X,... eee ree Meee en eon 
is (X, Q), where X = Xi and Q = yx, - xX. X and Q are 


i=l 
independent. We have to construct a test of 


Ay: u+20>0 
against 
My: u+2o <0. 


Obviously, if . > 0 then Ap is true. A test of Hj : w > 0 versus Hy : 
L < Ois 


g(x, 0) = 41 es 2a ree Sail 


0, otherwise. 
where S* = Q/(n — 1) is the sample variance. We should have a more 
stringent test. Consider the statistic = ~ F[1,n —1;4] where 4 = 
2 
— Notice that for F*, u2 = 402 or Ag = 2n. Let 6 = (u,0). If 
0 < 8, then A > 2n and if @ € ©o, A < 2n. Thus, the suggested test is 


nx? 
l, if > Poll, m— 152n] 


0, otherwise. 


o(X, S) = 


(ii) The power function is 
WA) = P{F[l,n— 1,4] = Fi-a[l,n — 1; 2n]}. 


The distribution of a noncentral F[1,n — 1;A] is like that of (1 + 
2J)F[1+2J,n — 1], where J ~ Pois(A) (see Section 2.8). Thus, the 
power function is 


WA) = Ex{P{d +2) FU +2J,n -—1) > Fi-a[l,n — 15;2n] | J}}. 
The conditional probability is an increasing function of J. Thus, by 
Karlin’s Lemma, (A) is an increasing function of 4. Notice that 
Praieet ys M Pte OFLA 2 = Sh 

i! 
j=0 


Thus, Fi_g[1,n — 1;A] 7 x. 
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4.6.2 The GLR statistic is 


up. “EGitaewiie) 


0<02 <00, [L1,...5 [Lk 


sup, sees Mks nae iS) (Sarg! op) 
2 


es 


(He 7 N/2? 


k 
where N = yoni, Q; = (n; — 1)S?. S2= 70, /(N — k). 


i=1 i=1 


N —k)S2 k 2 
—2log A(X) = Weg (OPP Sp i) - ym og (“—*) 


k S2 k 1—+£ 
= Somlog + Yomog (=F). 
i=1 i isl 


nj 


A= 


k k 
1— 
Notice that if ny =--- =n, =n, then y nlog i x = 0. Thus ,for 
i=l Bake 


k 2 
large samples, as n — ov, the Bartlett test is: reject Ho if Soni log + > 
i=l i 
x om [k — 1]. We have k — 1 degrees of freedom since o” is unknown. 


we (S)a(ee( 9) 


Let O= JCS + ¥?) and P = ae Ayo: p =0, o arbitrary; Hy : 
i=l i=l 


p #0, o? arbitrary. The MLE of o” under Ho is 6? = - The likelihood 
n 
of (p, 0”) is 


Hage 1 1 2 te 
os “Tae | fale ral 
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1 
Thus, sup L(0, o”) = —— exp(—n). We find now 6” and f that 
0<a?<0o (Q/2n)" 


maximize L(p, 07). Let l(p, 07) = log L(p, 0”) 
al n 1 
= + 
do? 02 204(1 — p?) 


al no _ p(Q — 2p P) — P(1— p’) 
dp = 1—p? (1 — p?)?o? , 


(Q — 2pP) 


Equating these partial derivatives to zero, we get the two equations 


Q-—20P _ 

O Sa coy 72" 

(iy vot ee Pal app) 
Ce page ne 


Q 


2P 
Solution of these equations gives the roots 6? = on and 6 = 0° Thus, 
n 


sup L(p, 0) = 
p,o 


It follows that 


2P\? 
A(X, Y) is small if (=) is small or P is close to zero. 


4.6.6 Model II of ANOVA is 


Xjj =U+G + ij, a — a (eet ae ean) ere /) 


iid. 


where e;; '~ N(0, 02), a; ‘~ N(, 12), {e;;} independent of {a;}. 


S? = DOK — Xi?/ — 0). 


j=l 
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Notice that Xj; | a; ~ N(uwt+ ai, a”). Hence, Xij ~ N(w, o* + t”). More- 


over, 
Sla,~o*x7[n-1]Jfm—-D, i=1,...,7 
and 
1 | 
S=-S 8 ~o’x7[rn-1 a7); 
P= > 7 ~ 0? x7[r(n — II/r@ — 1) 
r 7 7 7 1 r 7 2 

(i) S? a Te — X)*, where X = pee Given a;, X; and is? 
are conditionally independent. Since the distribution of s. does not 
depend on a;, X; and S? are independent for all i = 1,..., 7. Hence 


Ss? is independent of Se 


2 
(ii) X; ~ N (u. ee °) for alli = 1,...,r. Hence, 
n 


r _ 2 
Shi — HP ~ (= eS ) lr — 1. 
i=1 


Thus, 
$= 2K — XP ~ (6? +nt7)x7[r — 1/r — 1). 
‘ po etme x81) 
(iii) ~ 9 e x2irn—D\i/ra—b 


ae: 
~ (14 5) Flr 1,r(n — 1). 
o 


(iv) The ANOVA test of Hy : t* = 0, o arbitrary against H, : t7 > 0, o 
arbitrary is the F test 


O(F) = IF > F\-a[r — 1,r@— DI}. 


(v) The power function is 
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CHAPTER 5 


Statistical Estimation 


PART I: THEORY 
5.1 GENERAL DISCUSSION 


Point estimators are sample statistics that are designed to yield numerical estimates of 
certain characteristics of interest of the parent distribution. While in testing hypotheses 
we are generally interested in drawing general conclusions about the characteristics 
of the distribution, for example, whether its expected value (mean) is positive or 
negative, in problems of estimation we are concerned with the actual value of the 
characteristic. Generally, we can formulate, as in testing of hypotheses, a statisti- 
cal model that expresses the available information concerning the type of distribu- 
tion under consideration. In this connection, we distinguish between parametric and 
nonparametric (or distribution free) models. Parametric models specify parametric 
families of distributions. It is assumed in these cases that the observations in the 
sample are generated from a parent distribution that belongs to the prescribed family. 
The estimators that are applied in parametric models depend in their structure and 
properties on the specific parametric family under consideration. On the other hand, 
if we do not wish, for various reasons, to subject the estimation procedure to strong 
assumptions concerning the family to which the parent distribution belongs, a distri- 
bution free procedure may be more reasonable. In Example 5.1, we illustrate some 
of these ideas. 

This chapter is devoted to the theory and applications of these types of estimators: 
unbiased, maximum likelihood, equivariant, moment equations, pretest, and robust 
estimators. 
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5.2 UNBIASED ESTIMATORS 


5.2.1 General Definition and Example 


Unbiased estimators of a characteristic 0(F) of F in F is an estimator §(X) 
satisfying 


E,{6(X)} =6(F), forall F ef, (5.2.1) 


where X is a random vector representing the sample random variables. For example, 
if O(F) = Er{X}, assuming that E {|X|} < oo forall F € Ff, then the sample mean 


X= . > X; is an unbiased estimator of 0(F). Moreover, if Vr{X} < oo for all 
F ef, then the sample variance S? = — V(X; - X,)* is an unbiased estimator 
of Vr{X}. We note that all the examples of unbiased estimators given here are 
distribution free. They are valid for any distribution for which the expectation or the 
variance exist. For parametric models one can do better by using unbiased estimators 
which are functions of the minimal sufficient statistics. The comparison of unbiased 
estimators is in terms of their variances. Of two unbiased estimators, the one having 
a smaller variance is considered better, or more efficient. One reason for preferring 
the unbiased estimator with the smaller variance is in the connection between the 
variance of the estimator and the probability that it belongs to a fixed-width interval 
centered at the unknown characteristic. In Example 5.2, we illustrate a case in which 
the distribution-free estimator of the expectation is inefficient. 


5.2.2. Minimum Variance Unbiased Estimators 


In Example 5.2, one can see a case where an unbiased estimator, which is not a 
function of the minimal sufficient statistic (m.s.s.), has a larger variance than the one 
based on the m.s.s. The question is whether this result holds generally. The main 
theorem of this section establishes that if a family of distribution functions admits a 
complete sufficient statistic then the minimum variance unbiased estimator (MVUE) 
is unique, with probability one, and is a function of that statistic. The following is 
the fundamental theorem of the theory of unbiased estimation. It was proven by Rao 
(1945, 1947, 1949), Blackwell (1947), and Lehmann and Scheffé (1950). 


Theorem 5.2.1 (The Rao—Blackwell-—Lehmann-Scheffé Theorem). Let F = 
{F(x;0);@ € ©} be a parametric family of distributions of a random vector X = 
(X1,..., Xn). Suppose that w = g(@) has an unbiased estimator g(X). If F admits a 
(minimal) sufficient statistic T(X) then 


@ = E{g(X) | T(X)} (5.2.2) 
is an unbiased estimator of w and 


Varg{@} < Vare{g(X)}, (5.2.3) 
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for all 6 € ©. Furthermore, if T (X) is a complete sufficient statistic then @ is essen- 
tially the unique minimum variance, unbiased (MVU) estimator, for each 0 in ©. 


Proof. (i) Since T(X) is a sufficient statistic, the conditional expectation 
E{g(X) | T(X)} does not depend on 6 and is therefore a statistic. Moreover, according 
to the law of the iterated expectations and since g(X) is unbiased, we obtain 


g(O) = Eo{8(X)} 
= Eo{E{g(X) | T(X)}} (5.2.4) 
= Eo{o}, forall dE O. 


Hence, @ is an unbiased estimator of g(@). By the law of the total variance, 
Vara {&(X)} = Eo{Var{g(X) | TOO} + Varo{E{8(X) | TOO}. (5.2.5) 


The second term on the RHS of (5.2.5) is the variance of ©. Moreover, Var{g(X) | 
T(X)} => 0 with probability one for each 6 in ©. Hence, the first term on the RHS of 
(5.2.5) is nonnegative. This establishes (5.2.3). 

(ii) Let T(X) be a complete sufficient statistic and assume that © = ¢;(T(X)). 
Let @(X) be any unbiased estimator of w = g(@), which depends on 7(X), ie., 
@(X) = $2(T(X)). Then, Eg{@} = Eg {@CX)} for all 6. Or, equivalently 


Eo{o\(T) —o(T)} =0, all @€O. (5.2.6) 


Hence, from the completeness of T(X), ¢,(7) = ¢2(T) with probability one for each 
0 € ©. This proves that @ = ¢)(T) is essentially unique and implies also that @ has 
the minimal variance at each 0. QED 


Part (i) of the above theorem provides also a method of constructing MVUEs. 
One starts with any unbiased estimator, as simple as possible, and then determines 
its conditional expectation, given T (X). This procedure of deriving MVUESs is called 
in the literature “Rao—Blackwellization.” Example 5.3 illustrates this method. 

In the following section, we prove and illustrate an information lower bound for 
variances of unbiased estimators. This lower bound plays an important role in the 
theory of statistical inference. 


5.2.3 The Cramér—Rao Lower Bound for the One-Parameter Case 


The following theorem was first proven by Fréchet (1943) and then by Rao (1945) 
and Cramér (1946). Although conditions (i)—(iii), (v) of the following theorem coin- 
cide with conditions (3.7.8) we restate them. Conditions (i)—(iv) will be labeled the 
Cramér—Rao (CR) regularity conditions. 
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Theorem 5.2.2. Let F be a one-parameter family of distributions of a random vector 
X = (X1,..., Xn), having probability density functions (p.d.f's) f(x;@), 0 € ©. Let 
a(@) be a differentiable function of 0 and @(X) an unbiased estimator of w(@). Assume 
that the following regularity conditions hold: 


(i) © is an open interval on the real line. 
(ii) 2 f(s) exists (finite) for every x and every @ in ©, and {x : f(x;0) > O} 
does not depend on 0. 


(iii) For each @ in ©, there exists ad > 0 and a positive integrable function G(x; @) 
such that for all @ € (0 — 6,8 +54) 


ae f%GO)| G(x:8). 


¢—0 


(iv) For each@ in ©, there exists a5’ > O and a positive integrable function H(x; 0) 
such that, for all @ € (6 — 5';8 + 5’) 


f(% @) — f(x) 
a) 


(x) < H(x;6). 


2 
(v) 0< 1,(0) = Eo [aoe 0X:6)] | < o for each@ € ©. 


Then, 


(WYP 


Varg{@(X)} > T.@)” 


forall 0 €®@. (5.2.7) 


0 
Proof. Consider the covariance, for a given @ value, between — log f(X;6) 


and @(X). We have shown in (3.7.3) that under the above regularity conditions 


Eo {a log r0x0)} = 0. Hence, 


0 0 
CoV (= log f(X; 6), x) = Eg {35 log f(X; 0) - 0x] 


= / - log f(x;0)- @(x) f(K30)dx = (5.2.8) 


(oe) 


= 5 a @(x) f (x; 0)dx = w'(8). 
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The interchange of differentiation and integration is justified by condition (iv). On 
the other hand, by the Schwarz inequality 


2 
(cov (= log f(X; 4), x) ) < Var{@(X)} - [,(8), (5.2.9) 


) 
since the variance of 30 log f CX; @) is equal to the Fisher information function /,,(0), 


0 
and the square of the coefficient of correlation between @(X) and Ey log f (X; 6) 


cannot exceed 1. From (5.2.8) and (5.2.9), we obtain the Cramér—Rao inequality 
(5.2.7). QED 


We show that if an unbiased estimator 6 (X) has a distribution of the one-parameter 
exponential type, then the variance of 9(X) attains the Cramér—Rao lower bound. 
Indeed, let 


f(G;0) = h() exp(6¥(@) — K@)}, (5.2.10) 
where w(6) and K (@) are differentiable, and w’(6) # 0 for all 6 then 


K'(6) 


E,{0} = 5.2.11 
AO} =+ FO) ( ) 
and 
~  —W"(0)K'(@) + W'(@)K"@) 

Vo {0} = 5.2.12 
{0} wo ( ) 

Since 6(X) is a sufficient statistic, [,,(@) is equal to 
1,(0) = ('(0) Vo{O(X)}. (6:2.:13) 


Moreover, 6(X) is an unbiased estimator of g(0) = +K'(6)/w'(@). Hence, we readily 
obtain that 


[y"(@)K'@) — w'(@)K"(@)P 
(w'(0))° Vo {(X)} 


Va{O(X)} = = (9'(0))"/1(0). (5.2.14) 


We ask now the question: if the variance of an unbiased estimator §(X) attains the 
Cramér—Rao lower bound, can we infer that its distribution is of the one-parameter 
exponential type? Joshi (1976) provided a counter example. However, under the right 
regularity conditions the above implication can be made. These conditions were given 
first by Wijsman (1973) and then generalized by Joshi (1976). 

Bhattacharyya (1946) generalized the Cramér—Rao lower bound to (regular) cases 
where w(@) is k-times differentiable at all 6. This generalization shows that, under 
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further regularity conditions, if w(6) is the ith derivative of w(@) and V isak x k 
positive definite matrix, for all 6, with elements 


V,=E : a: X;0 L a X36 
— Gee Fx agi Hf 


then 
Varg{@(X)} > (w(6),..., O® ) Vo (0), ..., oO (@)Y.- (5.2.15) 


Fend (1959) has proven that if the distribution of X belongs to the one-parameter 
exponential family, and if the variance of an unbiased estimator of w(@), @(X), 
attains the kth order Bhattacharyya lower bound (BLB) for all 6, but does not attain 
the (k — 1)st lower bound, then @(X) is a polynomial of degree k in U(X). 


5.2.4 Extension of the Cramér—Rao Inequality to Multiparameter Cases 


The Cramér—Rao inequality can be generalized to estimation problems in k-parameter 
models in the following manner. Suppose that F is a family of distribution functions 
having density functions (or probability functions) f(x;6) where 6 = (6),..., 0)’ 
is a k-dimensional vector. Let /(@) denote ak x k Fisher information matrix, with 
elements 


0 0 
T;(0) = Eo { rp low (x0). aoe rx} 
i j 


i, j =1,...,k. We obviously assume that for each @ in the parameter space ©, 
1,;(0) is finite. It is easy to show that the matrix /(@) is nonnegative definite. We will 
assume, however, that the Fisher information matrix is positive definite. Furthermore, 
let g\(0),..., g-(@) be r parametric functions r = 1, 2,...,k. Define the matrix of 
partial derivatives 


DO) = {DO); i=1,....75 fol,...,k}, (5.2.16) 
) , 
where D,(@) = ane gi (0). Let &(X) be an r-dimensional vector of unbiased estimators 


J 

of g1(0),..., g, (0), 1.e., &(X) = ()(X), ..., &(X)). Let L(g) denote the variance— 
covariance matrix of &(X). The Cramér—Rao inequality can then be generalized, 
under regularity conditions similar to those of the theorem, to yield the inequality 


¥(8) > D(O)(1(0))'D'), (5.2.17) 
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in the sense that ~(&) — D(0)(1(@))~! D’(@) is a nonnegative definite matrix. In the 
special case of one parameter function g(9), if &(X) is an unbiased estimator of g(0) 
then 


Vary {8(X)} = (Vg) TO) | 7 gO), (5.2.18) 


a a ‘ 
where 7 g(0) = (Za, Pree =«(0)) : 


5.2.5 General Inequalities of the Cramér—Rao Type 


The Cramér—Rao inequality is based on four stringent assumptions concerning the 
family of distributions under consideration. These assumptions may not be fulfilled 
in cases of practical interest. In order to overcome this difficulty, several studies 
were performed and various different general inequalities were suggested. Blyth and 
Roberts (1972) provided a general theoretical framework for these generalizations. 
We present here the essential results. 

Let X,,..., X, be independent and identically distributed (i.i.d.) random variables 
having a common distribution F that belongs to a one-parameter family F, having 
p.d.f. f(x; 6), 6 € ©. Suppose that g(@) is a parametric function considered for esti- 
mation. Let T(X) be a sufficient statistic for F and let 8(T) be an unbiased estimator 
of g(@). Let W(T; @) be a real-valued random variable such that Varg{W(T;6@)} > 0 
and finite for every 0. We also assume that 0 < Varg{g(T)} < co for each @ in ©. 
Then, from the Schwarz inequality, we obtain 


és (cove(8(T), W(T, 0)) 
Vara{e(T)} = Vary(W(P.@) (5.2.19) 


for every 0 € ©. We recall that for the Cramér—Rao inequality, we have used 
0 0 
W(T;6)= 30 log f(X;@) = 30 log h(T; 8), (5.2.20) 


where h(t; 0) is the p.d.f. of T at 0. 
Chapman and Robbins (1951) and Kiefer (1952) considered a family of random 
variables Wy(T;6), where $ ranges over © and is given by the likelihood ratio 


h(T; o) 
Wat) reat 


. The inequality (5.2.19) then becomes 


(($) — (0) 


Vare{s(T)} = Vare(W(T.8))" (5.2.21) 
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One obtains then that (5.2.21) holds for each ¢ in ©. Hence, considering the supremum 
of the RHS of (5.2.21) over all values of ¢, we obtain 


7 2 
Vare{2(T)} = sup (8@) — 8@)y (5.2.22) 


oo Ald, A) 
where A(O, 6) = Varo {W(T; @)}. Indeed, 


cow(g(T), Wo(T, 8)) = Eg{8(T)} — Eol8(T)} - EotWo(T; )} 


(5.2.23) 
= g(h) — g(@). 

This inequality requires that all the p.d-f.s of T, ie., h(t; 6), 9 € ©, will be positive 
on the same set, which is independent of any unknown parameter. Such a condition 
restricts the application of the Chapman—Robbins inequality. We cannot consider it, 
for example, in the case of a life-testing model in which the family F is that of location- 
parameter exponential distributions, i.e., f(x;@) = I{x > 0}exp{—(x — 6)}, with 
0 < 6 < oo. However, one can consider the variable Wg(T; @) for all @ values such 
that h(t; @) = 0 on the set No = {t : h(t;@) = O}. In the above location-parameter 
example, we can restrict attention to the set of ¢ values that are greater than ©. If we 
denote this set by C(@) then we have the Chapman—Robbins inequality as follow: 


ven 2 
Varg{(T)} = sup (8) ~ s@y" (5.2.24) 


gece) AO, o) 


The Chapman-Robbins inequality is applicable, as we have seen in the previous 
example, in cases where the Cramér—Rao inequality is inapplicable. On the other hand, 
we can apply the Chapman—Robbins inequality also in cases satisfying the Cramér— 
Rao regularity conditions. The question is then, what is the relationship between 
the Chapman—Robbins lower bound and Cramér—Rao lower bound. Chapman and 
Robbins (1951) have shown that their lower bound is greater than or equal to the 
Cramér—Rao lower bound for all 6. 


5.3. THE EFFICIENCY OF UNBIASED ESTIMATORS 
IN REGULAR CASES 


Let 8;(X) and %(X) be two unbiased estimators of g(@). Assume that the den- 
sity functions and the estimators satisfy the Cramér—Rao regularity conditions. The 
relative efficiency of g,(X) to g2(X) is defined as the ratio of their variances, 


o3,(8) 


a2) (5.3.1) 


Eo(81, 82) = 
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where oF (0) @ = 1, 2) is the variance of g;(X) at 6. In order to compare all the 


unbiased estimators of g(@) on the same basis, we replace Op, (6) by the Cramér—Rao 
lower bound (5.2.7). In this manner, we obtain the efficiency function 


(g'(0)Y° 


08) = Fox)" 


(5.3.2) 


for all 9 € ©. This function assumes values between zero and one. It is equal to one, 
for all 6, if and only if oF (@) attains the Cramér—Rao lower bound, or equivalently, if 
the distribution of g(X) is of the exponential type. 

Consider the covariance between g(X) and the score function S(X;@) = 


= log f(x; @). As we have shown in the proof of the Cramér—Rao inequality that 
(g'@Y = pz, S\In(0)o5), (5.3.3) 


where p9(g, S) is the coefficient of correlation between the estimator ¢ and the score 
function, S(X; 0), at 0. Hence, the efficiency function is 


Eo(8) = p5(8. 8). (5.3.4) 

Moreover, the relative efficiency of two unbiased estimators 2; and 8 is given by 
Eo(81, 82) = 05 (81, 8)/ pp (82. S). (5.3.5) 
This relative efficiency can be expressed also in terms of the ratio of the Fisher 


information functions obtained from the corresponding distributions of the estimators. 
That is, if h(g;;0),i = 1, 2,is the p.d-f. of g; and [8 (0) = Ets log h(g;; 6)]}°} then 


1810) 


7T2(6)’ €O. (5.3.6) 


€o(81, 82) = 


It is a straightforward matter to show that for every unbiased estimator 2 of g(@) and 
under the Cramér—Rao regularity conditions 


156) = (g'(6))"/oz(0), forall 0 O. (5.3.7) 


Thus, the relative efficiency function (5.3.6) can be written, for cases satisfying the 
Cramér-Rao regularity condition, in the form 


(gi 9%,(8) 
(3 | 07)’ 


€6(81, 22) = (5.3.8) 
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where g,(X) and g,(X) are unbiased estimators of g;(@) and g2(@), respectively. If 
the two estimators are unbiased estimators of the same function g(0) then (5.3.8) 
is reduced to (5.3.1). The relative efficiency function (5.3.8) is known as the Pit- 
man relative efficiency. It relates both the variances and the derivatives of the bias 
functions of the two estimators (see Pitman, 1948). 

The information function of an estimator can be generalized to the multiparameter 
regular case (see Bhapkar, 1972). Let 6 = (6), ..., 0) be a vector of k-parameters 
and I(0) be the Fisher information matrix (corresponding to one observation). If 
gi(0),...,g-(0), 1 <r <k, are functions satisfying the required differentiability 
conditions and g)(X),..., -(X) are the corresponding unbiased estimators then, 
from (5.2.18), 


1 
| Xo(&)| = ~ DO) 'O)D'O)|, (5.3.9) 


where n is the sample size. Note that ifr = k then D(@) is nonsingular (the parametric 
functions 21(0),..., g¢(@) are linearly independent), and we can express the above 
inequality in the form 


|D(O)/? 


tC) eee 
EO = Say 


forallO € ©. (5.3.10) 


Accordingly, and in analogy to (5.3.7), we define the amount of information in the 
vector estimator § as 


|D(@)|? 
Z,(0) = F 5.3.11 
s@) | X0(2)| 
If 1 <r <k but D(@) is of full rank r, then 
|D(@)D'(8)| 
Z,(0) = ————_. 5.3.12 
we | Xo(2)| 


The efficiency function of a multiparameter estimator is thus defined by DeGroot and 
Raghavachari (1970) as 


T,(0) 
7,0) 


€6(8n) = (5.3.13) 


In Example 5.9, we illustrate the computation needed to determine this efficiency 
function. 
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5.4. BEST LINEAR UNBIASED AND LEAST-SQUARES ESTIMATORS 


Best linear unbiased estimators (BLUEs) are linear combinations of the observations 
that yield unbiased estimates of the unknown parameters with minimal variance. As 
we have seen in Section 5.3, the uniformly minimum variance unbiased (UMVU) 
estimators (if they exist) are in many cases nonlinear functions of the observations. 
Accordingly, if we confine attention to linear estimators, the variance of the BLUE 
will not be smaller than that of the UMVU. On the other hand, BLUEs may exist 


when UMVU estimators do not exist. For example, if X;,..., X, and iid. random 
variables having a Weibull distribution G!/8(A, 1) and both A and 6 are unknown 
0 <A, B < o, the m.s.s. is the order statistic (X(1), ..., X(n)). Suppose that we wish 
to estimate the parametric functions 1 = — log A and o = —. There are no UMVU 


estimators of jz and o. However, there are BLUEs of these parameters. 


5.4.1 BLUEs of the Mean 


We start with the case where the 1 random variables have the same unknown mean, |Z 
and the covariance matrix is known. Thus, let X = (X,,..., X,,)/ be arandom vector; 
E{X} = 1,1’ =(1,1,..., 1); w is unknown (real). The covariance of X is 1. We 
assume that {is finite and nonsingular. A linear estimator of jz is a linear function 
fi = A’X, where A is a vector of known constants. The expected value of ff is wu if, 
and only if, 1’1 = 1. We thus consider the class of all such unbiased estimators and 
look for the one with the smallest variance. Such an estimator is called best linear 
unbiased (BLUE). The variance of fi is V{A’X} = ADA. We, therefore, determine 
2° that minimizes this variance and satisfies the condition of unbiasedness. Thus, we 
have to minimize the Lagrangian 


LA, t) =A FA+ r(1 — AD). (5.4.1) 


It is simple to show that the minimizing vector is unique and is given by 


oe in (5.4.2) 
Correspondingly, the BLUE is 
a=VE'X/VT'1. (5.4.3) 
Note that this BLUE can be obtained also by minimizing the quadratic form 
O(u) = (X— wl E'(X— pI). (5.4.4) 


In Example 5.12, we illustrate a BLUE of the form (5.4.3). 


332 STATISTICAL ESTIMATION 


5.4.2 Least-Squares and BLUEs in Linear Models 


Consider the problem of estimating a vector of parameters in cases where the means 
of the observations are linear combinations of the unknown parameters. Such models 
are called linear models. The literature on estimating parameters in linear mod- 
els is so vast that it would be impractical to try listing here all the major studies. 
We mention, however, the books of Rao (1973), Graybill (1961, 1976), Anderson 
(1958), Searle (1971), Seber (1977), Draper and Smith (1966), and Sen and Srivastava 
(1990). We provide here a short exposition of the least-squares theory for cases of 
full linear rank. 
Linear models of full rank. Suppose that the random vector X has expectation 


E{X} = Ap, (5.4.5) 


where X is ann x | vector, A is ann x p matrix of known constants, and 6 a p x 1 
vector of unknown parameters. We furthermore assume that | < p <n and Aisa 


matrix of full rank, p. The covariance matrix of X is Y = o?1, where o” is unknown, 
0 < 0? < oo. Anestimator of 6 that minimizes the quadratic form 
Q(B) = (X — Ap) (X — AB) (5.4.6) 


is called the least-squares estimator (LSE). This estimator was discussed in Exam- 
ple 2.13 and in Section 4.6 in connection with testing in normal regression models. 
The notation here is different from that of Section 4.6 in order to keep it in agree- 
ment with the previous notation of the present section. As given by (4.6.5), the LSE 
of B is 


B= (AA) 'A'X. (5.4.7) 


Note that B is an unbiased estimator of 8. To verify it, substitute AB in (5.3.7) 
instead of X. Furthermore, if BX is an arbitrary unbiased estimator of B (B a 
p Xn matrix of specified constants) then B should satisfy the condition BA = /. 
Moreover, the covariance matrix of BX can be expressed in the following manner. 
Write B = B— S~-'A’+ S~'A’, where S = A’A. Accordingly, the covariance matrix 
of BX is 


¥(BX) = Y(CX) + YB) + 2H(CX, ), (5.4.8) 


where C = B—S7!A’, B is the LSE and ¥(CX, B) is the covariance matrix of CX 
and B. This covariance matrix is 


¥(CX, 8) = o?(B — S-! ANAS! 


(5.4.9) 
= 07(BAS-' — S“') =0, 
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since BA = I. Thus, the covariance matrix of an arbitrary unbiased estimator of B 
can be expressed as the sum of two covariance matrices, one of the LSE, B , and one 
of CX. ¥(CX) is a nonnegative definite matrix. Obviously, when B = S~'A’ the 
covariance matrix of CX is 0. Otherwise, all the components of B have variances 
which are smaller than or equal to that of BX. Moreover, any linear combination of 
the components of B has a variance not exceeding that of BX. It means that the LSE, 


na 


B, is also BLUE. We have thus proven the celebrated following theorem. 


Gauss—Markov Theorem. /fX = AB + €, where A is a matrix of full rank, Efe} = 
0 and X(€) = o7I, then the BLUE of any linear combination i'B is \'B, where 2 is 
a vector of constants and B is the LSE of B. Moreover, 


Var{r'B} = 07'S, (5.4.10) 
where S = A'A. 


Note that an unbiased estimator of o? is 
a2 1 ! —l ar 
6° = ——X (1 —-AS A’)X. (5.4.11) 
n—p 


If the covariance of X is o*V, where V is a known symmetric positive definite 
matrix then, after making the factorization V = DD’ and the transformation Y = 
D~'X the problem is reduced to the one with covariance matrix proportional to J. 
Substituting D~'X for X and D~'A for A in (5.3.7), we obtain the general formula 


B = (A'V~!A)!A’V-EX, (5.4.12) 


The estimator (5.4.12) is the BLUE of B and can be considered as the multidimen- 
sional generalization of (5.4.3). 

As is illustrated in Example 5.10, when V is an arbitrary positive definite matrix, 
the BLUE (5.3.12) is not necessarily equivalent to the LSE (5.3.7). The conditions 
under which the two estimators are equivalent were studied by Watson (1967) and 
Zyskind (1967). The main result is that the BLUE and the LSE coincide when the 
rank of A is p, | < p <n, if and only if there exist p eigenvectors of V which form 


a basis in the linear space spanned by the columns of A. Haberman (1974) proved 
P 


the following interesting inequality. Let 0 = ae B;, where (cj,..., Cp) are given 
i=l 

constants. Let 6 and 6* be, correspondingly, the BLUE and LSE of 0. If t is the ratio 

of the largest to the smallest eigenvalues of V then 


Var{6} At 
> > = 
~ Var{6*} ~— (1+7) 


(5.4.13) 
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5.4.3 Best Linear Combinations of Order Statistics 


Best linear combinations of order statistics are particularly attractive estimates when 
the family of distributions under consideration depends on location and scale parame- 
ters and the sample is relatively small. More specifically, suppose that F is a location- 
and scale-parameter family, with p.d.f.s 


1 a 
f(cip,o) = —$ (—*). 
oO oO 


where —oo < uw < ow and 0 <0 < ow. Let U =(X — )/o be the standardized 
random variable corresponding to X. Suppose that X,,..., X, are iid. and let 
X* = (Xi, ..., Xin)’ be the corresponding order statistic. Note that 


Xi) ~ w+ aU, i=l,...,n, 


where U;,..., U, are i.i.d. standard variables and (U(1), ..., U(n)) the corresponding 
order statistic. The p.d.f. of Vis (u). If the covariance matrix, V, of the order statistic 
(Ua), ..-, Uy) exists, and if « = (a,..., @,)' denotes the vector of expectations of 
this order statistic, i.e.,@; = E{Uq)},i = 1,...,n, then we have the linear model 


X* = [1a] i) +e, (5.4.14) 


where E {e*} = O and Ye*) = V. This covariance matrix is known. Hence, according 
to (5.3.12), the BLUE of (uu, 0) is 


fi VV Ve \ 1 VEX 
= (5.4.15) 
6 VV-'a a’V—a a’ V—-X* 
Let 
A= (VV! D(a’ V a) — VV a)? 
and 
C=V'(e' —a1')V/a, 


then the BLUE can be written as 


(5.4.16) 
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The variances and covariances of these BLUEs are 


o2 
Via} = wc le), 
(5.4.17) 
o2 
V{a} = ac, 


and 
2 
cov({, é) = ——('V-'@). 


As will be illustrated in the following example the proposed BLUE, based on all the 
n order statistics, becomes impractical in certain situations. 

Example 5.11 illustrates an estimation problem for which the BLUE based on 
all the n order statistics can be determined only numerically, provided the sample is 
not too large. Various methods have been developed to approximate the BLUEs by 
linear combinations of a small number of selected order statistics. Asymptotic (large 
sample) theory has been applied in the theory leading to the optimal choice of selected 
set of k, k <n, order statistics. This choice of order statistics is also called spacing. 
For the theories and methods used for the determination of the optimal spacing see 
the book of Sarhan and Greenberg (1962). 


5.5 STABILIZING THE LSE: RIDGE REGRESSIONS 


The method of ridge regression was introduced by Hoerl (1962) and by Hoerl 
and Kennard (1970). A considerable number of papers have been written on the 
subject since then. In particular see the papers of Marquardt (1970), Stone and 
Conniffe (1973), and others. The main objective of the ridge regression method is 
to overcome a phenomenon of possible instability of least-squares estimates, when 
the matrix of coefficients 5 = A’A has a large spread of the eigenvalues. To be more 
specific, consider again the linear model of full rank: X = AB + €, where E{e} = 0 
and Y(e€) = o7J. We have seen that the LSE of B, B = S~'A’X, minimizes the 
squared distance between the observed random vector X and the estimate of its 
expectation AB, i.e., x — AB||?. |{a|| denotes the Euclidean length of the vector a, 
/ 


i.e., |{a]| = og . As we have shown in Section 5.3.2, the LSE in the present 
i=1 


model is BLUE of 8. However, if A is ill-conditioned, in the sense that the positive 
definite matrix S = A’A has large spread of the eigenvalues, with some being close 
to zero, then the LSE B may be with high probability very far from f. Indeed, if 
L? = ||B — B|/? then 


EAT?) = 07174871}. (5.5.1) 
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Let P be an orthogonal matrix that diagonalizes S, i.c., PSP’ = A, where A is 


a diagonal matrix consisting of the eigenvalues (A},...,A,) of S (all positive). 
Accordingly 
| 
E{’}=o0°) —. 5:5;2 
{L7}=0 2 e (5.5.2) 
We see that E{L7} > o7—_, where Amin is the smallest eigenvalue. A very large 


value of E{L?} means dan at least one of the components of B has a large variance. 
This implies that the corresponding value of 6; may with high probability be far from 
the true value. The matrix A in experimental situations often represents the levels 
of certain factors and is generally under control of the experimenter. A good design 
will set the levels of the factors so that the columns of A will be orthogonal. In this 
case S=1,d,; =... =A, = 1 and E{L’} attains the minimum possible value po? 
for the LSE. In many practical cases, however, X is observed with an ill-conditioned 
coefficient matrix A. In this case, all the unbiased estimators of B are expected to 
have large values of L?. The way to overcome this deficiency is to consider biased 
estimators of B which are not affected strongly by small eigenvalues. Hoerl (1962) 
suggested the class of biased estimators 


BY(k) = [A'A+kI]'A'X (5.5.3) 


with k > 0, called the ridge regression estimators. It can be shown for every k > 0, 
B*(k) has smaller length than the LSE B, i.e., ||B*(k)|| < ||B||. The ridge estimator is 
compared to the LSE. If we graph the values of 8*(k) as functions of k we often see 
that the estimates are very sensitive to changes in the values of k close to zero, while 
eventually as k grows the estimates stabilize. The graphs of 6*(k) fori =1,...,k 
are called the ridge trace. It is recommended by Hoerl and Kennard (1970) to choose 
the value of k at which the estimates start to stabilize. 

Among all (biased) estimators B of 6 that lie at a fixed distance from the origin the 
ridge estimator B*(k), for a proper choice of k, minimizes the residual sum of squares 
|X — AB||?. For proofs of these geometrical properties, see Hoerl and Kennard 
(1970). The sum of mean-squared errors (MSEs) of the components of B *(k) is 


2 


Pp 
E{L7(k)} = E{||B*() — BII?} = 0? 4+PY—* 65.5.4) 
Dae er wat > Gj; + ke 


where y = HB and H is the orthogonal matrix diagonalizing A’A. E{L?(k)} 
is a differentiable function of k, having a unique minimum ky). Moreover, 
E{L7(k°(B))} < E{L?(0)}, where E{L7(0)} is the sum of variances of the LSE com- 
ponents, as in (5.4.2). The problem is that the value of k°(y) depends on y and if k is 
chosen too far from k°(y), E{L7(k)} may be greater than E{L?(0)}. Thus, a crucial 
problem in applying the ridge-regression method is the choice of a flattening factor 
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k. Hoerl, Kennard, and Baldwin (1975) studied the characteristics of the estimator 
obtained by substituting in (5.4.3) an estimate of the optimal k°(y). They considered 
the estimator 


po 


k= a2? 
|B|| 


(5.5.5) 


where B is the LSE and 6? is the estimate of the variance around the regression line, 
as in (5.4.11). The estimator B - (k) is not linear in X, since k is a nonlinear function 
of X. Most of the results proven for a fixed value of k do not necessarily hold when 
k is random, as in (5.5.5). For this reason Hoerl, Kennard, and Baldwin performed 
extensive simulation experiments to obtain estimates of the important characteristics 
of B *(K). They found that with probability greater than 0.5 the ridge-type estimator 
BH is closer (has smaller distance norm) to the true 6 than the LSE. Moreover, 
this probability increases as the dimension p of the factor space increases and as the 
spread of the eigenvalues of S increases. The ridge type estimator B*(k) are similar to 
other types of nonlinear estimators (James—Stein, Bayes, and other types) designed 
to reduce the MSE. These are discussed in Chapter 8. 

A more general class of ridge-type estimators called the generalized ridge regres- 
sion estimators is given by 


B =(A'A+C)1AX, (5.5.6) 


where C is a positive definite matrix chosen so that A’A + C is nonsingular. [The 
class is actually defined also for A'A + C singular with a Moore—Penrose generalized 
inverse replacing (A’A + C)~!; see Marquardt (1970).] 


5.6 MAXIMUM LIKELIHOOD ESTIMATORS 


5.6.1 Definition and Examples 


In Section 3.3, we introduced the notion of the likelihood function, L(6; x) defined 
over a parameter space ©, and studied some of its properties. We develop here an 
estimation theory based on the likelihood function. 

The maximum likelihood estimator (MLE) of @ is a value of 6 at which the 
likelihood function L(@;x) attains its supremum (or maximum). We remark that if 
the family F admits a nontrivial sufficient statistic T(X) then the MLE is a function of 
T (X). This is implied immediately from the Neyman-Fisher Factorization Theorem. 
Indeed, in this case, 


f(%; 0) = h(x)g(T(X); 8), 


where A(x) > O with probability one. Hence, the kernel of the likelihood function 
can be written as L*(6;x) = g(T(x);@). Accordingly, the value 6 that maximizes it 
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depends on T(x). We also notice that although the MLE is a function of the sufficient 
statistic, the converse is not always true. An MLE is not necessarily a sufficient 
statistic. 


5.6.2. MLEs in Exponential Type Families 


Let X,,..., X, be iid. random variables having a k-parameter exponential type 
family, with a p.df. of the form (2.16.2). The likelihood function of the natural 
parameters is 


k 
LW; X) = exp } 0 WiTi(X) — nip , (5.6.1) 


i=1 


where 
TAX) = UK) BS en Ie 
j=l 
The MLEs of W,..., Wz are obtained by solving the system of k equations 


a ee 
ag KW = = DL UX), 
(5.6.2) 


a (ee 
a KWH; S> Ux(X)). 


Note that whenever the expectations exist, Ey{U;(X)} = 0K(w)/dy; for each 
7] Ks 

i=1,...,k. Hence, if X,,...,X, are iid. Ey {Kw = 0K(w)/dwW;, for 

eachi = 1,...,k, where v is the vector of MLEs. For all points w in the interior of 


0 
the parameter space n, the matrix (-sa Kon: Se es ere k) exists and is 
aWidW; ; 
positive definite for all w since K (yw) is convex. Thus, the root of (5.6.2) is unique 
and is a m.s.s. 


5.6.3 The Invariance Principle 


If the vector 6 = (6), ..., 0) is reparametrized by a one-to-one transformation yy = 
gi(O),..., We = gx(0) then the MLEs of y; are obtained by substituting in the g- 
functions the MLEs of @. This is obviously true when the transformation 6 > w 
is one-to-one. Indeed, if 6; = 2, (W), MOE — g (W) then the likelihood function 
L(0; x) can be expressed as a function of y, L(gi | (W), ae g, (Ww); x). If (61, ..., &) 
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is a point at which L(@, x) attains its supremum, and if v = (g, (6), suis Bie (6)) then, 
since the transformation is one-to-one, 


sup L(0;x) = L6;x) = L(g7'W),-.-. & (Ws) = L*(h) = up L*(W;x), 
(5.6.3) 


where L*(w;x) is the likelihood, as a function of w. This result can be extended to 
general transformations, not necessarily one-to-one, by a proper redefinition of the 
concept of MLE over the space of the w-values. Let w = g(@) be a vector valued 
function of 0; ie., w = g(9) = (g1(0), ..., gx(@)) where the dimension of g(@), r, 
does not exceed that of 0, k. 

Following Zehna (1966), we introduce the notion of the profile likelihood function 
of wy = (Ww, ..., W,). Define the cosets of @-values 


G(v) = (8; 8() = ¥}, (5.6.4) 


and let L(@; x) be the likelihood function of @ given x. The profile likelihood of w 
given x is defined as 


L*(w;x) = sup L(6;x). (5.6.5) 
bEG(p) 


Obviously, in the one-to-one case L*(@;x) = L(g; ' (6), Pets g, (8); x,). Generally, 
we define the MLE of yw to be the value at which L*(w; x) attains its supremum. It is 
easy then to prove that if 6 is an MLE of #0 and f = g(@), then is an MLE of y, 
1.€., 


sup L*(p;x) = L*(W;x). (5.6.6) 
wv 


5.6.4 MLE of the Parameters of Tolerance Distributions 


Suppose that k-independent experiments are performed at controllable real-valued 
experimental levels (dosages) —oo < x; <... < x, < oo. At each of these levels 
nj; Bernoulli trials are performed (j = 1,...,k). The success probabilities of these 
Bernoulli trials are increasing functions F(x) of x. These functions, called tolerance 
distributions, are the expected proportion of (individuals) units in a population whose 
tolerance against the applied dosage does not exceed the level x. The model thus 
consists of k-independent random variables Jj, ..., J, such that J; ~ B(n;, F(x;;4)), 
i=1,...,k, where # = (6),...,6,), 1 <r <k, isa vector of unknown parameters. 
The problem is to estimate 8. Frequently applied models are 


O(a + Bx), normal distributions; 
F(x:0) = { (1 +exp{—(a@ + Bx)})~!, logistic distributions; (5.6.7) 


exp{—exp{—(@ + Bx)}}, | extreme-value distribution. 
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We remark that in some of the modern literature the tolerance distributions are called 
link functions (see Lindsey, 1996). Generally, if F(a + 6.x;) is the success proba- 
bility at level x;, the likelihood function of (a, 8), given Jj,..., J, and x1,..., Xx, 
N1,...,Mx, 18 


k 


F ae 
L(a, B | J,x,m) = \eeceeered -[ [Ul - F@+ px", 6.6.8) 


i=1 


and the log-likelihood function is 


F(a + Bx;) 


SFU Tt yn log — F(@ + Bxi)). 


log L(a, B | J. x, n) = 3 eee ee 


The MLE of @ and £ are the roots of the nonlinear equations 


: f(a + Bx) ‘fla + Bx)) 
2 Fa Baas B) DEF "i Fe + Bx)’ 
J J j=l a 


ys f(a + Bx;) =n vg OF Bx) 
ay F(a + pxj)F(@ + Bx;) I Ba + Bx;)’ 


(5.6.9) 


where f(z) = F’(z) is the p.d.f. of the standardized distribution F(z) and F(z) = 
1 — F(z). 
Let p; = J;/n;,i = 1,...,k, and define the function 


G(z; p) = TP = ey, 00 <z7< 00. (5.6.10) 


F(z)F(z) 


Accordingly, the MLEs of w and £ are the roots & and £ of the equations 
k 
S > n:G(@ + Bx;: pi) =0 (5.6.11) 
i=l 
and 
k 
Yo xiniG@ + Bxi; pi) = 0. 
i=l 


The solution of this system of (generally nonlinear) equations according to the 
Newton-Raphson method proceeds as follows. Let & and fp be an initial solution. 
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The adjustment after the jth iteration (j =0,1,...) is @j4; =@j; +4a; and 
B itl = B ; + 6B;, where da; and 56; are solutions of the linear equations 


k k 
1= i= 


k 
ee ms wi? 50; 2M 
i= 


= (5.6.12) 


k k k 
Sox wi? Siew? Bj yiay? 
i=l i=1 


i=l 
where 

WY = n,G'@; + Bjxi3 Bi) (5.6.13) 
and 

yo) = —njG(@; + Bjxi3 Bi) 


and G'(z; p) = ZG; p). The linear equations (5.6.12) resemble the normal 
equations in weighted least-squares estimation. However, in the present problems the 
weights depend on the unknown parameters a and £. In each iteration, the current 
estimates of w and 6 are substituted. For applications of this procedure in statistical 
reliability and bioassay quantal response analysis, see Finney (1964), Gross and 
Clark (1975), and Zacks (1997). 


5.7 EQUIVARIANT ESTIMATORS 


5.7.1 The Structure of Equivariant Estimators 


Certain families of distributions have structural properties that are preserved under 
transformations of the random variables. For example, if X has an absolutely con- 


tinuous distribution belonging to a family 7 which depends on location and scale 
x— 


1 
parameters, i.e., its p.d-f. is f(x; u,0) = —@ ( “), where —co < ft < oo and 
oO 


0 <o < , then any real-affine transformation of X, given by 


la, B]JX =a+BX, -wK<a<oo, 0<B<co 


1 ee 
yields arandom variable Y = a + BX withp.df. f(y; u,0) = —¢ 2 where 
o 


fi =a+ Bu ando = Bo. Thus, the distribution of Y belongs to the same family F. 
The family ¥ is preserved under transformations belonging to the group G = {[a, 6]; 
—0o0 <a < &,0 < B < ov} of real-affine transformations. 

In this section, we present the elements of the theory of families of distributions 
and corresponding estimators having structural properties that are preserved under 
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certain groups of transformations. For a comprehensive treatment of the theory and 
its geometrical interpretation, see the book of Fraser (1968). Advanced treatment 
of the subject can be found in Berk (1967), Hall, Wijsman, and Ghosh (1965), 
Wijsman (1990), and Eaton (1989). We require that every element g of G be a 
one-to-one transformation of V onto ¥. Accordingly, the sample space structure 
does not change under these transformations. Moreover, if 6 is the Borel o-field on 
& then, for all g € G, we require that Pp[gB] will be well defined for all Be B 
and 6 € ©. Furthermore, as seen in the above example of the location and scale 
parameter distributions, if 6 is a parameter of the distribution of X the parameter of 
Y = gX is gO, where g is a transformation on the parameter space © defined by the 
relationship 


Po[B] = PxlgB)], forevery Be B. (5.7.1) 


In the example of real-affine transformations, if g = [a, 6] and 6 = (uw, o), then 
&(u,0) = (a+ Bu, Bo). We note that 6 = © for every g corresponding to g in 


G. Suppose that X;,..., X;, are i.i.d. random variables whose distribution F belongs 
to a family F that is preserved under transformations belonging to a group G. If 
T(X,,..., Xn) is a Statistic, then we define the transformations % on the range T of 
T(X\,..., Xn), corresponding to transformations g of G, by 
&T(X1,.--,Xn) = T( 9x1, .--, BXn)- (5.7.2) 
A Statistic $(X,,..., X;,) is called invariant with respect to G if 
&S(X1,..., Xn) = S(X1,..., Xn) forevery g EG. (5.7.3) 


A coset of x° with respect to G is the set of all points that can be obtained as images 
of x°, i.e., 


C(x°) = {x : x = gx°, forall g €G}. 


Such a coset is called also an orbit of G in ¥ through x? If x? = (a Eons x) isa 
given vector, the orbit of G in ¥” through x° is the coset 


C(x°) = {x : x = (gx?,..., gx°), forall g € G}. 


If x) and x?) belong to the same orbit and S(x) = S(x;,..., X,) is invariant with 
respect to G then S(x) = S(x). A statistic U(X) = U(X, ..., Xp) is called max- 
imal invariant if it is invariant and if X“ and X” belong to two different orbits then 
U(X) ¢ U(X). Every invariant statistic is a function of a maximal invariant 
statistic. 
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If 6(X 1,.-.,X,) iS an estimator of 6, it would be often desirable to have the 
property that the estimator reacts to transformations of G in the same manner as the 
parameters 6 do, i.e., 


O(&x) = 20(x), forevery g EG. (5.7.4) 


5.7.2. Minimum MSE Equivariant Estimators 


Estimators satisfying (5.7.4) are called equivariant. The objective is to derive an 
equivariant estimator having a minimum MSE or another optimal property. The 
algebraic structure of the problem allows us often to search for such optimal estimators 
in a systematic manner. 


5.7.3. Minimum Risk Equivariant Estimators 


A loss function L(4(X), @) is called invariant under G if 
L(B6(X), 20) = L(G(X), 4), (5.7.5) 


for all9 € O andall g EG. 

The coset C() = {8;0 = 2, g € G} is called an orbit of G through 4 in ©. We 
show now that if 6(X) is an equivariant estimator and L@ (X), @) is an invariant loss 
function then the risk function R(6 ,O)=E {L(6(X), 6)} is constant on each orbit of 
G in ©. Indeed, for any g € G, if the distribution of X is F(x; 0) and the distribution 
of Y = gX is F(y; 30), then if 6 is equivariant 


R(O, 0) = [16. 0)d F(x;@) 


= [ -@6. g0)d F(x; 0) 


= / L(O(gx), 20)d F(x; 0) (5.7.6) 


Z i L(6(y), 20)d Fy; 36) 
= R(6, 20), forallgeG. 


Thus, whenever the structure of the model is such that © contains only one orbit 
with respect to G, and there exist equivariant estimators with finite risk, then each 
such equivariant estimator has a constant risk function. In Example 5.23, we illustrate 
such cases. We consider there the location and scale parameter family of the normal 
distributions N(jz, 0). This family has a parameter space ©, which has only one orbit 
with respect to the group G of real-affine transformations. If the parameter space has 
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various orbits, as in the case of Example 5.24, there is no global uniformly min- 
imum risk equivariant estimator, but only locally for each orbit. In Example 5.26, 
we construct uniformly minimum risk equivariant estimators of the scale and shape 
parameters of Weibull distributions for a group of transformations and a correspond- 
ing invariant loss function. 


5.7.4 The Pitman Estimators 


We develop here the minimum MSE equivariant estimators for the special models 
of location parameters and location and scale parameters. These estimators are 
called the Pitman estimators. 

Consider first the family F of location parameters distributions, i.e., every p.d.f. 
of F is given by f(x;0) = d(x — 9), —oo < 6 < oo. G(x) is the standard p.d.f. 
According to our previous discussion, we consider the group G of real translations. 
Let 6(X) be an equivariant estimator of @. Then, writing T = (0, Xa) — 9,..., Xv — 
6), where X(1) <...< X(,), for any equivariant estimator, d(X), of 6, we 
have 


d(X)=0 + (Xa) —9,..., Xin) — 9). 


Note that U = (Xq1) — 6,..., Xin) — 9) has a distribution that does not depend on 6. 
Moreover, since 0(X) is an equivariant estimator, we can write 


6(X) = 0+ F(Y), where Y = X— 61. 
Thus, the MSE of d(X) is 
MSE{d} = E{[F(Y) + w(Xqy —9,..., Xv — OI}. (CHES 


It follows immediately that the function y(U) which minimizes the MSE is the 
conditional expectation 


w°(U) = —E{F(Y) | U}. (5.7.8) 
Thus, the minimum MSE equivariant estimator is 
d°(X) = 6(X) — E{F(Y) | U}. (5.7.9) 


This is a generalized form of the Pitman estimator. The well-known specific form 
of the Pitman estimator is obtained by starting with 6(X) = X). In this case, 
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F(Y) = Yq), where Y(1) is the minimum of a sample from a standard distribution. 
Formula (5.7.9) is then reduced to the special form 


d°(X) = Xqay — E{Yay | Xa — Xay,---, Xi — Xap} 


fore) n 
ug(u)) [OM + udu 
ig I] My (5.7.10) 
= Xa) = 7 
i guy] [o%@ + wdu 
= i=2 
where Yj) = Xi) — Xi), i = 2,..., 2”. In the derivation of (5.7.9), we have assumed 


that the MSE of d(X) exists. A minimum risk equivariant estimator may not exist. 
Finally, we mentioned that the minimum MSE equivariant estimators are unbiased. 
Indeed 

E{d°(x)} = 6 + E{F(Y) — E{F(Y) | w}=0, -0o <@<oo. (5.7.11) 


If F is a scale and location family of distribution, with p.d.f.s of the form 


1 x 
f@su,o) = o( “). O<x<0, -wW<p<w, I0<0 <M, 
o o 


where $(u) is a p.d.f., then every equivariant estimator of jz with respect to the group 
G of real-affine transformations can be expressed in the form 


A(X) = Xa) t+ (Xa) — Xv), (5.7.12) 


where X(1) < ... < Xn) is the order statistic, X(2) — Xq) > Oand Z = (Z3,..., Zn)’, 
with Zi = (Xi = Xay)/(X2 = Xa). The MSE of fi(X) is given by 


MSE{fi(X); w, 0} = 0 Eo{[Xay + (Xa — Xi) WDP} 
(5.7.13) 
= 0 Eo{Eo{[Xay + (Xa) — Xa) W(DF | Z}, 


where Ep{-} designates an expectation with respect to the standard distribution 
(uw = 0, o = 1). An optimal choice of w(Z) is such for which Eo{[X(q1) + (Xa) — 
Xa) w(D/P | Z} is minimal. Thus, the minimum MSE equivariant estimator of ju is 


A(X) = Xqy + Xe —- Xuy)W(D), (5.7.14) 


where 


Eo{Xy(Xa) — Xqay) | Z} 


0 
Z)= 
Ve) Eo{(X) — Xa)? | Z} 


(5.7.15) 
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Equivalently, the Pitman estimator of the location parameter is expressed as 
A(X) = Xa) — (Xa — Xa): 


[900 | v" bu + v»f Loe +vZ;)dvdu (6.716 


/ ow f v'd(ut vy] [ou + vZ;)dvdu 
—0o 0 


i=3 
In a similar manner, we show that the minimum MSE equivariant estimator for o is 


60(Xn) = (Xe) — Xiy)W(Z3, ..., Zn), where 


—_ E{U2 | Z3,..., Zn} 


(Z3,...,Zn) = : 
EAUs | Za pueen Lal 


(5.7.17) 


Indeed, w°(Z) minimizes Eo{(U2W(Z) — 1)? | Z}. Accordingly, the Pitman estimator 
of the scale parameter, o, is 


i plu) / u> "(uy + ur)] [ot + urZidurdu 
lo) 0 i-3 


[oor foc + wef [on + 1pZidduadin 
foe) 0 


i=3 


6o(X,) = (Xe) — Xqy)- 


(5.7.18) 
5.8. ESTIMATING EQUATIONS 
5.8.1 Moment-Equations Estimators 
Suppose that F is a family of distributions depending on k real parameters, 6), ... , O, 


1 < k. Suppose that the moments 1,, 1 < r < k, exist and are given by some specified 
functions 


My = M,(61,...,0%), Lr <k. 


If X,,..., X» are iid. random variables having a distribution in F, the sample 


1 
moments M, = — ux! are unbiased estimators of ,(1 <r < k) and by the laws of 
n 


large numbers (see Section 1.11) they converge almost surely to jz, as n — oo. The 
roots of the system of equations 


a 


M, = M,(6,,...,), 1<r<k, (5.8.1) 


are called the moment-equations estimators (MEEs) of 6), ... , 0. 
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In Examples 5.28-5.29, we discuss cases where both the MLE and the MEE can 
be easily determined, but the MLE exhibiting better characteristics. The question is 
then, why should we consider the MEEs at all? The reasons for considering MEEs 
are as follows: 


1. By using the method of moment equations one can often easily deter- 
mine consistent estimators having asymptotically normal distributions. These 
notions of consistency and asymptotic normality are defined and discussed in 
Chapter 7. 

2. There are cases in which it is difficult to determine the MLEs, while the MEEs 
can be readily determined, and can be used as a first approximation in an 
iterative algorithm. 


3. There are cases in which MLEs do not exist, while MEEs do exist. 


5.8.2. General Theory of Estimating Functions 


Both the MLE and the MME are special cases of a class of estimators called estimating 
functions estimator. A function g(X; 6), X ¢ 4” and 6 € @, is called an estimating 
function, if the root 6(X) of the equation 


g(X,0)=0 (5.8.2) 


belongs to ©; i.e., 6 (X) is an estimator of 9. Note that if 6 is a k-dimensional vector 
then (5.8.2) is a system of k-independent equations in @. In other words, g(X, @) is a 
k-dimensional vector function, i.e., 


a(X, 0) = (g1(X, 8), ..., (XK, 8). 


6(X) is the simultaneous solution of 


@i(X, 0) =0, 
@0(X, 8) = 0, 

(5.8.3) 
2.(X, 0) = 0. 


In the MEE case, g;(X, 0) = M;(0),..., 0) —m; @ = 1,...,k). In the MLE case, 
@ . 
8i(X, 0) = = log F(X; 4), hei Peon 8 


In both cases, E,{g(X, 0)} = 0 for all 6, under the CR regularity conditions (see 
Theorem 5.2.2). 
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An estimating function g(X, 6) is called unbiased if Eg{g(X;6)} = 0 for all 0. 
The information in an estimating function g(X, @) is defined as 


9 2 
[Seo] 
1,(0) = or (5.8.4) 


For example, if g(X, 0) is the score function S(X, 6), then under the regularity 
0 
conditions (3.7.2), E { specXon} = —1(9) and E{S?(X;6)} = 1(0), where 1(0) 


is the Fisher information function. A basic result of is that /,(0) < 1(@) for all 
unbiased estimating functions. 

The CR regularity conditions are now generalized for estimating functions. The 
regularity conditions for estimating functions are as follows: 


dg(x,0) ' ik 
(i) 7a exists for all 6, for almost all x (with probability one). 


(ii) / g(x, 0)d F(x, @) is differentiable with respect to 0 under the integral sign, 
for all 6. 


a 
(iii) Eo {peck of # 0 and exists for all 0. 


(iv) Eo{g?(X, 0)} < oo forall 0. 


Let T be a sufficient statistic for a parametric family ¥. Bhapkar (1972) proved 
that, for any unbiased estimating function g, if 


g(T, 0) = E{g(X, 6) | T} 


then 7,(0) < I,«(@) for all @ with equality if and only if g* € F”. This is a gen- 

eralization of the Blackwell—Rao Theorem to unbiased estimating functions. Under 
0 

the regularity conditions, the score function S(X, 6) = 30 log f(X, 6) depends on 

X only through the likelihood statistic T(X), which is minimal sufficient. Thus, the 

score function is most informative among the unbiased estimating functions that 


satisfy the regularity conditions. If 6 is a vector parameter, then the information 
in g is 


I,(0) = G"(0)X,'(0)G(6), (5.8.5) 
where 
c= (e, {| nt) (5.8.6) 
J 
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and 
¥,(0) = Eo{g(X, 0g’ (X, 6}, (5.8.7) 


where g(X, 0) = (g1(X, 0), ..., gx(X, 0)) is a vector of k estimating functions, for 
estimating the k components of 6. 

We can show that /(@) = J,(@) is a nonnegative definite matrix, and /(@) is the 
Fisher information matrix. 

Various applications of the theory of estimating functions can be found in Godambe 
(1991). 


5.9 PRETEST ESTIMATORS 


Pretest estimators (PTEs) are estimators of the parameters, or functions of the param- 
eters of a distribution, which combine testing of some hypothesis (es) and estimation 
for the purpose of reducing the MSE of the estimator. The idea of preliminary testing 
has been employed informally in statistical methodology in many different ways and 
forms. Statistical inference is often based on some model, which assumes a certain 
set of assumptions. If the model is correct, or adequately fits the empirical data, the 
statistician may approach the problem of estimating the parameters of interest in a 
certain manner. However, if the model is rejectable by the data the estimation of 
the parameter of interest may have to follow a different procedure. An estimation 
procedure that assumes one of two alternative forms, according to the result of a test 
of some hypothesis, is called a pretest estimation procedure. 

PTEs have been studied in various estimation problems, in particular in various 
least-squares estimation problems for linear models. As we have seen in Section 4.6, 
if some of the parameters of a linear model can be assumed to be zero (or negligible), 
the LSE should be modified, according to formula (4.6.14). Accordingly, if B denotes 
the unconstrained LSE of a full-rank model and B* the constrained LSE (4.6.14), the 
PRE of £B is 


Ba = BI{A} + B*I{A}, (5.9.1) 


where A denotes the acceptance set of the hypothesis Ho : 6-41 = B42 =...= 
Bp = 0; and A the complement of A. An extensive study of PREs for linear models, 
of the form (5.8.5), is presented in the book of Judge and Bock (1978). The reader is 
referred also to the review paper of Billah and Saleh (1998). 


5.10 ROBUST ESTIMATION OF THE LOCATION AND SCALE 
PARAMETERS OF SYMMETRIC DISTRIBUTIONS 


In this section, we provide some new developments concerning the estimation of 
the location parameter, 4, and the scale parameter, o, in a parametric family, 7, 
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es “), and f(—x) = f(x) for 
all —co < x < oo. We have seen in various examples before that an estimator of ju, 
or of o, which has small MSE for one family may not be as good for another. We 
provide below some variance comparisons of the sample mean, X, and the sample 
median, M,, for the following families: normal, mixture of normal and rectangular, 
t[v], Laplace and Cauchy. The mixtures of normal and rectangular distributions will 
be denoted by (1 — a)N + aR(—3o, 30). Such a family of mixtures has the standard 
density function 


1 
whose p.d.f.s are of the form f(x;u,0) = —f 
o 


1 
exp| sel + Sr 30 <x <30}, -w<x< OO. 


l-a 
f@)= 5 bs 


J2n 


The f[v] distributions have a standard p.d.f. as given in (2.13.5). The asymptotic 
(large sample) variance of the sample median, M,, is given by the formula (7.9.3) 


o2 


AV{M.} = 4f2(0)n’ 


(5.10.1) 


provided f(0) > 0, and f(x) is continuous at x = 0. 

In Table 5.1, we provide the asymptotic variances of X and M, and their ratio 
E = AV{X}/AV{M_}, for the families mentioned above. We see that the sample 
mean X which is a very good estimator of the location parameter, j1, when F is the 
family of normal distributions loses its efficiency when F deviates from normality. 
The reason is that the sample mean is very sensitive to deviations in the sample of the 
extreme values. The sample mean performs badly when the sample is drawn from a 
distribution having heavy tails (relatively high probabilities of large deviations from 
the median of the distribution). This phenomenon becomes very pronounced in the 
case of the Cauchy family. One can verify (Fisz, 1963, p. 156) that if X;,..., X» are 
i.i.d. random variables having a common Cauchy distribution than the sample mean 
X has the same Cauchy distribution, irrespective of the sample size. Furthermore, 
the Cauchy distribution does not have moments, or we can say that the variance of 
X is infinite. In order to avoid such possibly severe consequences due to the use of 
X as an estimator of jz, when the statistician specifies the model erroneously, several 


Table 5.1 Asymptotic Variances of X and M, 


Family xX M. E 
Normal o?/n mo*/2n 0.6366 
0.9N + 0.1R(—30, 30) 1.202/n 1.7102 /n 0.6776 
0.5N + 0.5R(—3o, 30) 1.507/n 3.125807/n 0.4799 
[4] 202/n 1602/9n 1.125 


Laplace 207/n o?/n 2.000 
Cauchy - a7? /4 foe) 
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types of less sensitive estimators of 4 and o were developed. These estimators are 
called robust in the sense that their performance is similar, in terms of the sampling 
variances and other characteristics, over a wide range of families of distributions. We 
provide now a few such robust estimators of the location parameter: 


1. a -Trimmed Means: The sample is ordered to obtain Xq1) <...< Xin). A 
proportion a of the smallest and largest values are removed and the mean of 
the remaining (1 — @)n of the values is determined. If [na] denotes the largest 
integer not exceeding na and if p = | + [na] — na then the a-trimmed mean 
is 

_ PX Gnaatty + X(naa2) +++ + PXn-tanl) 


fie = eon (5.10.2) 


The median, M, is a special case, when a — 0.5. 


2. Linear Combinations of Selected Order Statistics: This is a class of estimates 
which are linear combinations, with some specified weights of some selected 
order statistics. Gastwirth (1977) suggested the estimator 


LG = 3X qr4ip + 4Me + 3X n-[4)- (5.10.3) 
Another such estimator is called the trimean and is given by 


T RM = 0.25Xq4}41) + 0.5Me + 0.25X(n[4)- 


3. M -Estimates: The MLE estimates of jz and o are the simultaneous solutions 


of the equations 
/(Xi- bv 
orcas, 
o 


(1) ye —__—_4 =90 (5.10.4) 
i=1 


Or) 


and 


In analogy to the MLE solution and, in order to avoid strong dependence on a 
particular form of f(x), a general class of M-estimators is defined as the simultaneous 
solution of 


Yi (== “) = 4 (5.10.5) 
i=1 


352 STATISTICAL ESTIMATION 
and 
n 
xX,-— 
x(* H)=0 
z oO 
i=1 


for suitably chosen y(-) and x (-) functions. Huber (1964) proposed the M-estimators 
for which 


—k, z<-k, 
vVid2=4z2 —k<z<k, (5.10.6) 
k, z=k, 
and 
x(z) = Welz) — BA), (5.10.7) 
where 


a as 12 
k)=—— e(zje™>* dz. 
B(k) se | vie z 


The determination of Huber’s M-estimators requires numerical iterative solutions. It 
is customary to start with the initial solution of ~ = M, and o = (Q3 — Q;)/1.35, 
where Q3 — Q, is the interquartile range, or X(—(2}) — Xq2j+1- Values of k are 
usually taken in the interval [1, 2]. 

Other M-estimators were introduced by considering a different kind of y(-) func- 
tion. Having estimated the value of y by 7, use the estimator 


outer-mean, if y <2, 


up) = X, if2<7 <4, 
PE Dies if4<9 <45, 
LG, ifp < 4.5, 


where the “outer-mean” is the mean of the extreme values in the sample. The reader 
is referred to the Princeton Study (Andrews et al., 1972) for a comprehensive exam- 
ination of these and many other robust estimators of the location parameter. Another 
important article on the subject is that of Huber (1964, 1967). 

Robust estimators of the scale parameter, o, are not as well developed as those of 
the location parameter. The estimators that are used are 


ae = (Q3 — Q))/1.35, 
ee (Xw — Mel, i=1,...,n)/0.6754, 


2b M.\- 
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Further developments have been recently attained in the area of robust estimation of 
regression coefficients in multiple regression problems. 


PART II: EXAMPLES 


Example 5.1. In the production of concrete, it is required that the proportion of 
concrete cubes (of specified dimensions) having compressive strength not smaller 
than & be at least 0.95. In other words, if X is a random variable representing the 
compressive strength of a concrete cube, we require that P{X > &} = 0.95. This 
probability is a numerical characteristic of the distribution of X. Let X;,..., X, bea 
sample of i.i.d. random variables representing the compressive strength of n randomly 
chosen cubes from the production process under consideration. If we do not wish 
to subject the estimation of pop = P{X > & } to strong assumptions concerning the 
distribution of X we can estimate this probability by the proportion of cubes in the 
sample whose strength is at least & ; 1.e., 


je 
p= — I{X; > , 
p =e {X; > Eo} 


We note that np has the binomial distribution B(n, po). Thus, properties of the 
estimator p can be deduced from this binomial distribution. 

A commonly accepted model for the compressive strength is the family of log- 
normal distributions. If we are willing to commit the estimation procedure to this 
model we can obtain estimators of po which are more efficient than p, provided the 


2 1 
model is correct. Let Y; = log X;,i=1,..., andletY, =—Y Y;, S = Y; - 
og i and le oe Ee Xt 


Y,°/(n — 1). Let no = log &p. Then, an estimator of po can be 


Pp = o(* iia “) 
Sn ‘ 


where ®(u) is the standard normal c.d.f. Note that Y,, and S, are the sample statistics 
that are substituted to estimate the unknown parameters (€, 0). Moreover, (Yn, Sn) is 
am.s.s. for the family of log-normal distributions. The estimator we have exhibited 
depends on the sample values only through the m.s.s. As will be shown later the 
estimator p has certain optimal properties in large samples, and even in small samples 
it is a reasonable estimator to use, provided the statistical model used is adequate for 
the real phenomenon at hand. a 


Example 5.2. Let X,,..., X,, be i.i.d. random variables having a rectangular distri- 
bution R(0, 6), 0 < 6 < oo. Suppose that the characteristic of interest is the expec- 
tation 4 = 6/2. The unbiased estimator fi = X,, has a variance 


22 g2 
Vet Xn} a 
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On the other hand, consider the m.s.s. X(,) = max {X;}. The expected value of Xn) 
<i<n 


1S 


Eo{X j= 2 [ra 7% 
CS Om Jo Se 


‘ , atl : : : : 
Hence, the estimator @ = “ay (n) 1s also an unbiased estimator of jz. The variance 
n 
of fi is 
2 


Vo{i} = ————_. 
{i} 4n(n + 2) 

Thus, Vo{ii} < Vo{X,} for all n > 2, and {2 is a better estimator than X,. We note 

that “ depends on the m.s.s. X(,), while X,, is not a sufficient statistic. This is the 

main reason for the superiority of (i over X,,. The theoretical justification is provided 

in the Rao—Blackwell Theorem. | 


Example 5.3. Let X,,..., X, be 1.i.d. random variables having a common nor- 
mal distribution, i.e., F = {N(E, 07); co < & < 00,0 <o < oo}. Both the mean 
&€ and the variance o* are unknown. We wish to estimate unbiasedly the probabil- 
ity g(§,0) = Ps 5{X < &o}. Without loss of generality, assume that & = 0, which 


implies that g(€, 0) = O(E/o). Let X, = Ze, and S$? = Sy, — X) be 
Wy ae 

the sample mean and variance. (X, S*) is a complete sufficient statistic. According 

to the Rao—Blackwell Theorem, there exists an essentially unique unbiased estimator 

of &(€/o) that is a function of the complete sufficient statistic. We prove now that 

this UMVU estimator is 


0, if w(X, S) <0, 
if if w(X, S) > 1, 


where 


5 oe | (ee 
w(X, =5| G=Ds |. 


The proof is based on the following result (Ellison, 1964). If U and V are independent 
random variables U ~ B(25+, 25+) and V ~ (x?[v])!/? then (2U — 1)V ~ N(O, 1). 
Let v =n — land V = /n —1 S/o. Accordingly 


aX, 8) = P{B(S-1.5-1)<w& 1X9}. 
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where 6(5 — 1, 5 — 1) is independent of (X, S). Thus, by substituting in the expres- 
sion for w(X, S), we obtain 


Exo{@(X, S)} = P {e (28 (5 = 5 -1)-1)v< i =x} 


° 0, 1) < «| 
pen ies |. 


with N(0, 1) and N2(0, 1) independent standard normal random variables. Thus, 


=P {om 1)- 


E¢.o{8(X, S)} = P {woo I)< =| 
= (E/o), forall (€,0). 


We provide an additional example that illustrates the Rao—Blackwellization 
method. a 


Example 5.4. Let X,,..., X,, be i.i.d. random variables, having a common Poisson 
distribution, P(A), 0 < A < oo. We wish to estimate unbiasedly the Poisson proba- 
bility p(k; 2) = e~*A*/k! An unbiased estimator of p(k; 4) based on one observation 
is 


BEX) =HX1=h, &=0,1,.... 


Obviously, this estimator is inefficient. According to the Rao—Blackwell Theorem 
the MVUE of p(k; A) is 


P(RS Th) = E{I{X1 = k} | Th} 
= P[X; =k| Th], 
where T,, = >_ X; is the complete sufficient statistic. If T,, > 0 the conditional dis- 
1 
tribution of X,, given 7;, is the binomial B G ~) Accordingly, the MVUE of 
n 
D(k; A) is 


p . = 1 
Pk; Tn) (ein. =). if T, > 0, 
n 


1 1 
where b ¢ | fins ~) is the p.d.f. of the Binomial distribution B G ~), 
n 


n | 


Example 5.5. We have seen in Section 3.6 that if the m.s.s. S(X) is incomplete, there 
is reason to find an ancillary statistic A(X) and base the inference on the conditional 
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distribution of S(X), given A(X). We illustrate in the following example a case where 
such an analysis does not improve. 
Let X1,..., X, be i.id. random variables having a rectangular distribution in 


F ={R@-1,04+ 1);-cw <4 < ow}. 
A likelihood function for 6 is 
1 
L(@;X) = mn {Xa —-1<0<Xia)+ hf, 


where X(1) <--:< Xm) is the order statistic. A m.s.s. is (X(1), Xqm)). This 


A— 
statistic, however, is incomplete. Indeed, E, {Xn — Xa) - od = 0, but 
n 


-—1 
Po {Xn = X 1) = 2 = 0 for each 0. 


Writing R(O — 1,6 +1)~ 6 —14+2R(0, 1) we have X(1) ~ 6 — 1+ 2U qq) and 
X(ny ~ 9 —1+2Uq), where Uj) and Uj) are the order statistics from R(O, 1). 
Moreover, E{U)} = aI and E{Uq)} = AE It follows immediately that §= 
$(Xq) + X(n)) is unbiased. By the Blackwell—Rao Theorem it cannot be improved 
by conditioning on the sufficient statistic. 

We develop now the conditional distribution of 6, given the ancillary statistic 
W= Xin) = Xi). The p.d.f. of W is 

fut = OD 


r’?Q-r), O<r<2. 


The transformation (X(1), X(n)) > (6, W) is one to one. The joint p.d.f. of (6, W) is 


pot poe “164 - l<t<0 +1] 
4 w(t, r;8) = ———r ~-1l<t<0-- ; 
Ba Qn 5. 2 


Accordingly, 


1 r r 
ane =,—1 {6 SOE ees ee its 
BIW en ag ee et, 


a Ww Ww 
That is, |W ~ R (0-42 = 10-241). Thu, 


E{9|W}=0, forall —cw <0 <0, 


and 


(2—wy 


V{O |W} = = 
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We have seen already that 6 is an unbiased estimator. From the law of total variance, 
we get 
2 


Vill = Cp patD 


for all —co < 6 < oo. Thus, the variance of 6 was obtained from this conditional 
analysis. One can obtain the same result by computing V{U(1) + Un}. a 


Example 5.6. Consider the MVUE of the Poisson probabilities p(k; 4), derived in 
Example 5.4. We derive here the Cramér—Rao lower bound for the variance of this 
estimator. We first note that the Fisher information for a sample of 7 i.i.d. Poisson 
random variables is [,(A) = n/X. Furthermore, differentiating p(k; 4) with respect 
to A we obtain that 9, Pik) = —(pk;4) — p(k — 1;A)), where p(—1;) = 0. If 
D(k; T,) is the MVUE of p(k; A), then according to the Cramér—Rao inequality 


ae oe 
p(k — 13a) P 1), k>1, 
Var, { p(k; Tn)} > 4” 


~e7% k=0. 
n 


Strict inequality holds for all values of A, 0 < A < on, since the distribution of 
p(k; T,,) is not of the exponential type, although the distribution of 7, is Poisson. 
The Poisson family satisfies all the conditions of Joshi (1976) and therefore since 
the distribution of p(k; T,,) is not of the exponential type, the inequality is strict. 
Note that V{ p(k; T,} = E{(b(k; T,+))} _ p?(k;A). We can compute this variance 
numerically. a 


Example 5.7. Consider again the estimation problem of Examples 5.4 and 5.5, with 


1 Th 

k =0. The MVUE of w(A) = e™* is @(T,) = (1——) . The variance of @(T,) 
n 

can be obtained by considering the probability generating function of T,, ~ P(nd) at 


1 
f= (1 _ ): We thus obtain 
n 


Var, {@(Ty)} = e7*(e*/" — 1). 
Since w(A) is an analytic function, we can bound the variance of @(T,,) from below by 


2 y) 
using BLB of order k = 2 (see (5.2.15)). We obtain, Vi, = - Vp 0.Vg= oe 
Hence, the lower bound for k = 2 is 

te, hs a 
L(A) = -e I1+—], O0<A<om. 
n 2n 


This lower bound is larger than the Cramér—Rao lower bound for all 0 < 1 < oo. 
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Example 5.8. Let (X,, Y),..., (Xn, Y,) be i.i.d. vectors having a common bivari- 
ate normal distribution N (0, o” a p y-l1<p<1,0< o* < co. The complete 


SuEntenl Statistic for this faraily - bivariate normal distributions is 7,(X, Y) = 


WX ee y? ) and 7%)(X, Y) = yx ;Y;. We wish to estimate the coefficient of cor- 


i=1 i=] 
relation p. 


n n 
An unbiased estimator of ¢ is given by 6 = be: iY; / Ox Indeed 
i=l i=1 


1 n 
E{p | X}= ms DXF |X). 


i i=l 


But E{Y; | X} = eX; foralli = 1,...,n. Hence, E{f | X} = p w.p.1. The unbiased 
estimator is, however, not an MVUE. Indeed, 6 is not a function of (7,(X, Y), 
T,(X, Y)). The MVUE can be obtained, according to the Rao—Blackwell Theorem 
by determining the conditional expectation E{ | T;, To}. 

The variance of (6 is 


The Fisher information matrix in the present case is 


1 p 
ot o*(1 — p’) 


I(o*, p)=n 
—p 1+ 9° 


alarer): "ept) 


The inverse of the Fisher information matrix is 


7 o*(1+p*) o7p(1— p’) 
(I(o*, p))! =i(2 toi 2) A—pye | 


The lower bound on the variances of unbiased estimators of p is, therefore, (1 — 
p*)*/n. The ratio of the lower bound of the variance of f to the actual variance is 


a= p)m-2) 


p* for large n. Thus, f is a good unbiased estimator only if 
n 


p” is close to zero. a 
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Example 5.9. 


A. Let X1,...,X, be iid. random variables having a common location- 
parameter exponential distribution with a p.d_f. 


f (x30) = I{x = O}exp{—(x —0)}, -co <0 <om. 
The sample minimum X() is a complete sufficient statistic. X(1) is distributed 
1 7 
like 6 + G(n, 1). Hence, E{X(1)} = 6 + — and the MVUE of 6 is 6(Xq1)) = 
n 
1 
X 1) — —. The variance of this estimator is 
n 
7 1 
Vary {O(Xc1y)} = =a) forall —co <9 <o. 
n 


In the present case, the Fisher information /(@) does not exist. We derive now 
the modified Chapman—Robbins lower bound for the variance of an unbiased 
estimator of 0. Notice first that Wg(X 1); 0) = {Xa = p}e"?, where T = 
X 1), for all @ = 6. It is then easy to prove that 


AO, ¢) = expin(g—O)}", o>. 


Accordingly, 


a -— oy 
Vara {O(X(1))} = sup @ ) ‘ 
o>e exp{n(d — 8)} — 1 
The function x” /(e”* — 1) assumes a unique maximum over (0, 00) at the root 


1.592 
of the equation e”* (2 — nx) = 2. This root is approximately x9 = ———. This 
n 


approximation yields 


- 0.6476 
Vare{O(X(1)} = ae 


B. Consider the case of arandom sample from R(0, 6), 0 < 6 < oo. As shown in 
2 


A 1 
Example 3.11 A, In(0) = The UMVU estimator of 6 is 6, = + — Xp. 
n 
2 


The variance of 6, 1S Vo {6,} = . Thus, in this nonregular case 


n(n + 2) 


R 1 
Vo{O,} < T,©) forall O0< 0 <a. 
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However, 
lim Vo{6,}1.(@) = 1 forall 6. 
noo 
| 
Example 5.10. Let X,,...,X, be iid. random variables having the nor- 
mal distribution N(0,07) and Y;,...,Y, iid. random variables having the 


normal distribution N(y@?, 07), where —00 <0, y<o, and 0<0 < om. 
The vector X = (X;,...,X,)' is independent of Y=(,...,Y,)'. A mus.s. 
. _ _ 1 n . 1 n n > 
is (Xn, Yn, Qn), where X, =—) -X;, ¥,=— YY, and QO, = 9 (X; —XY + 
( On) nee nee Q 2 ) 
es —~Y v2 The Fisher information matrix can be obtained from the likelihood 
i=l 


function 


1 
oy" ex 


n 
The covariance matrix of the score functions is 
1+4y76? 2ye7 0 
3 4 
ni@,y,c)=—| "v8 8 Oo 
oO 
0 ) —= 


Thus, 


n>64 
Tn, y,07) = |nI@, y,07)| = —. 
Oo 


Consider the reparametrization g)(0,y,0)=90, g2(6,y, o*)= yor and 
93(0, y, 0) = 0%. The UMVU estimator is & = (X,, Yn, Qn/2(n — 1)). The variance 
covariance matrix of € is 


1 0 0 
2 
oO 

may=—}O P98 

0. Ogt = 

=) 
and 
1 0 0 


D(o)= | 2ye 6 O 
0 oO 1 
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, = = . The efficiency coefficient is &(g) = 
g i 

|X(8)| o8 
2 2 


Example 5.11. Let (X,, Y1),..., (Xn, Yn) be a sample of n i.i.d. vectors having a 
joint bivariate normal distribution 


w(t] Coo 2) 
lL pot T 
where —00 <6 < w,0<1T<Ww,0<0 <w,and-l1<p< 1. Assume that 07, 


t*, and p are known. The problem is to estimate the common mean ju. We develop 
the formula of the BLUE of yw. In the present case, 


and 


yon n e —pot 
~ @t(1—p2)\-pot oo J 


The BLUE of the common mean is according to (5.3.3) 
 _ oXn+ Vn 
is atl ” 


where X,, and Y,, are the sample means and 


tT? — pot . 
o = >=——.. __ provided pt #0. 
o* — pot 
Since Lis known, fi is UMVU estimator. | 
Example 5.12. Let X),..., X, be i.i.d. Weibull variables, ie., X ~ G'/8(A, 1), 


where 0 < A, 6 < oo. Both A and f are unknown. The m.s.s. is (X(1),..., X(n)). Let 
Y; = log X;, i=1,...,n, and Yu) = log Xv). Obviously, Ya) < Y(2) Sia Ss Yn): 
We obtain the linear model 


Yo =UtologGw, i=1,...,n, 
where pp = 7 loga and o = 1; Gy is the ith order statistic of n 1.i.d. variables 


distributed like G(1, 1). BLUEs of yw and o are given by (5.4.16), where « is the 
vector of E{log Gi} and V is the covariance matrix of log Gi). 
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The p.d.f. of Gi) is 


n} ( erty be Mn+) 


i—1)\n —i)! ; 


fo) = ( 


0 < x < oo. Hence, 


i-1 : 
uy j i! a ; i\p—u 
7 Efi Gy = : 1 J —u—(n—i+1+{)e d ’ 
a {log Gi} px ea. ue Uu 


—0oO 


The integral on the RHS is proportional to the expected value of the extreme value 
distribution. Thus, 


i-1 : : : 
f(y ae 
u Fay jJi@-1- /)! n-i+l+j 


where y = 0.577216... is the Euler constant. The values of a; can be determined 
numerically for any n andi = 1,...,n. Similar calculations yield formulae for the 
elements of the covariance matrix V. The point is that, from the obtained formulae 
of a; and V;;, we can determine the estimates only numerically. Moreover, the matrix 
V is of order n x n. Thus, if the sample involves a few hundreds observation the 
numerical inversion of V becomes difficult, if at all possible. a 


Example 5.13. Consider the multiple regression problem with p = 3, 07 = 1, for 
which the normal equations are 


1.07 0.27 0.66 Bi 1.05 
0.27 1.07 0.66 Bo | = | —0.06 
0.66 0.66 0.68 B3 0.83 


By employing the orthogonal (Helmert) transformation 


a a= alt, 
V3 V3 V3 
1 1 

age? ae ON 

Mee ol et 

6 V6 V6 

we obtain that 

2.0 0O 0 
H(A’A)H' = 0 0.8 0 
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That is, the eigenvalues of A’A are A; = 2, Ay = 0.8 and A3 = 0.02. The LSEs of 
B are By = —4,58625, Bo = —5.97375, and B; = 11.47. The variance covariance 
matrix of the LSE is 


: 9.125 7.875 —16.5 
W(B) = (A’A) |! = | 7.875 9.125 —16.5 }, 
—16.5 —16.5 33.5 


having a trace E {L7(0)} = 51.75 = bee haat In order to illustrate numerically the 
effect of the ridge regression, assume that the true value of B is (1.5, —6.5, 0.5). Let 
y = HB. The numerical value of y is (—2.59809, 5.65685, —2.44949). According 
to (5.4.4), we can write the sum of the MSEs of the components of B(k) by 


3 
E(L*(k)) = Lae (Ai ESS: La Gcape 
i=l i=l 


The estimate of k° is k = 0.249. In the following table, we provide some numerical 
results. 


k 0 0.05 0.075 0.10 0.125 0.15 
Bi(k) —4.58625 0.64636 —0.24878 —0.02500 0.11538 0.209518 
Bo(k) —5.97375 —1.95224 —1.51735 —1.25833 —1.08462 —0.958900 
B3(k) 11.47000 3.48641 2.64325 2.15000 1.82572 1.59589 


E{L7(k)} 51.75000 8.84077 7.70901 7.40709 7.39584 7.51305 


We see that the minimal E{L?(k)} is minimized for k° around 0.125. At this value 


of k, B(k) is substantially different from the LSE B(0). = 
Example 5.14. 
A. Let X1,..., X, be ii.d. random variables having a rectangular distribution 


RO, @),0<60< oo. Am.s.s. is the sample maximum X,(,). The likelihood 
function is L(0; Xq)) = O-"I{O > Xi}. Accordingly, the MLE of 6 is 6= 
Xn). 

B. Let X1,..., X, be ii.d. random variables having a rectangular distribution 
R(O, 39), where 0 < 0 < oo. The likelihood function is 


L(O;X) = (20) "1{@ < Xa), X~ < 30} 


1 
= (20) "1 {3X SOK Xo} , 


where X(j) = min{X;} and X(,) = max{X;}. The m.s.s. is (X(1), X(m)). We 
note that according to the present model X(,) < 3X1). If this inequality is not 
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satisfied then the model is incompatible with the data. It is easy to check that 
the MLE of 6 is 6 = iX (n): The MLE is not a sufficient statistic. 


C. Let X,,...,X, be iid. random variables having a rectangular distribution 
R(O,@ + 1), —co < @ < oo. The likelihood function in this case is 


L(@;X) = 1{8 < X 1) < Xn) <6+4+1}. 
Note that this likelihood function assumes a constant value | over the @ interval 
[Xin — 1, X(1)]. Accordingly, any value of @ in this interval is an MLE. In the 


present case, the MLE is not unique. a 


Example 5.15. Let X),..., X, bei.i.d. random variables having a common Laplace 
(double-exponential) distribution with p.d_f. 


|x — 
B 


1 
FOG B= 5, exp| 


I oo <x < OO, 


-wO<"<w,0<B<ow. 
A m.s.s. in the present case is the order statistic X(1) < ... < Xn). The likelihood 
function of (4, 8), given T = (X(1),..., Xm), is 


L(u, BT) = ze] 5 Poa. 


n 
The value of which minimizes yx (i) — | is the sample median M,. Hence, 
i=l 


sup, L(u, B;T) = L(Me, B; Tn) 


1 on 
2B oa |X; - ma 


Finally, by differentiating log L(M., 6; T) with respect to 6, we find that the value 
of 6 that maximizes L(M, 8;T) is 


In the present case, the sample median M, and the sample mean absolute devia- 
tion from M, are the MLEs of yw and £, respectively. The MLE is not a sufficient 
statistic. a 


Example 5.16. Consider the normal case in which X;,..., X, are 1.i.d. random 
variables distributed like N(j, 0); —-~wO<U<w,0< o2 < oo. Both parameters 
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- er ee = 7 
are unknown. The m.s.s. is (X, Q), where X = ~S°X; and Q = wee _ x The 
n 
i=l i=] 
likelihood function can be written as 


a 1 Q no- 
Lu.07:X, @) = a5 P| ra Cs wt. 


Whatever the value of o? is, the likelihood function is maximized by fi = X. It is 
easy to verify that the value of 0” maximizing (5.5.9) iso? = Q/n. 

The normal distributions under consideration can be written as a two-parameter 
exponential type, with p.d-f.s 


1 


fasw, Ww) = OnyR exp(WiT) + W2Tr —nK (Wi, Wr)}, 


where 
hM=>0X, h=> X?, Hh=pn/o?, w=-1/20’, 
and K(wW, Ww) = -y [42+ + log(—1 /2wW2). Differentiating the log-likelihood 


partially with respect to y and wW2, we obtain that the MLEs of these (natural) 
parameters should satisfy the system of equations 


Wi 
2 hh n 
2 1 Ty 


<a aa 
4; 22 n 


We note that 7;/n = fi and T)/n = 6? + fi? where fi = X and 6* = Q/n are the 
MLEs of jz and o?, respectively. Substituting of and o? + ?, we obtain yr, = 
(i/6*, > = —1/267. In other words, the relationship between the MLEs y, and yr 
to the MLEs fi and 6? is exactly like that of yr, and w to wz and o?. | 


Example 5.17. Consider again the model of Example 5.9. Differentiating the log- 
likelihood 


1(0,y,07 | X,Y, Q)=-—n log(o”) _ — cx -eyr+(r- yey ie al ; 
20 n 


with respect to the parameters, we obtain the equations 


“ (¥ — y6)6? =0 
oO 


366 STATISTICAL ESTIMATION 


and 
n n Q 


X¥-67F+(¥ -7" + =| =0. 
wt as Dai a 0 en ar 


The unique solution of these equations is 


6= xX, 
~ =Y/X?, 
and 
ae!) 
2n- 


It is interesting to realize that E{?} does not exist, and obviously ? does not have a 
finite variance. By the delta method one can find the asymptotic mean and variance 
of p. a 


Example 5.18. Let X,,..., X, be ii.d. random variables having a log-normal 
distribution LN (1, 07). The expected value of X and its variance are 


& = exp{u +.07/2} 
and 


D? = #(e" — 1). 


1 = 
We have previously shown that the MLEs of jz and o? are fj = — > Y; = Y and 
n 


By 52 2 

6° = —)-(¥; — Y)°, where Y; = log X;,i = 1,...,. Thus, the MLEs of € and D 
n 

are 


& = exp{a +.67/2} 


and 


BD? = 62(e*’ — 1). 
| 


Example 5.19. Let X), X2,..., X, be i.id. random variables having a normal 
distribution N(j, 07), —0o < pb < 00, 0 < a” < oo. The MLEs of pw and o? are 


a= X, and 62 = 2 where Q = wee _ Gay By the invariance principle, the 
i=l 


MLE of 6 = (4) is 6 = 0(*). = 
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Example 5.20. Consider the Weibull distributions, G!/4(, 1), where 0 < A, B < 00 
are unknown. The likelihood function of (A, 8) is 


n B n 
L(A, B;X) = (AB)" (1 x) exp {11} 
i=1 


i=1 


Note that the likelihood is equal to the joint p.d.f. of X multiplied by aes i, which 


i=l 
is positive with probability one. To obtain the MLEs of A and 6, we differentiate the 
log-likelihood partially with respect to these variables and set the derivatives equal 
to zero. We obtain the system of equations: 


We show now that f is always positive and that a unique solution exists. 
bet R= Gets %n)s where 0<x; <@, i=1,...,n, and let F(6;x)= 


2. log x; / > Note that, for every x, 


i=l i=1 


: 2 
Deloss ae at -(S¥ log «| 
A (bx) = = =0 


" bs) 


wn a strict unequally, if the x; values are net all the same. Indeec ifo; = xt and 7 = 


55 log oo then ari ;x)= De (log x; — 7) iy Hence, F(f;x) is 


i=1 i=1 


strictly increasing in £, with probability one. Furthermore, as F(B;x) = — S log x; 
> n 
and am F(B;x) = log x). Thus, the RHS of the 6-equation is positive, decreasing 
>0o 
1 n 
function of 6, approaching oo as 6 — 0 and approaching (log x(,) — -\° log x;)! 
n 


i=l 
as B —> oo. This proves that the solution f is unique. 
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The solution for 6 can be obtained iteratively from the recursive equation 


-1 
a 1 “ Bj | “ Bj i = 
Pj4t = : De log x; / (; Dae Yee | 5. 
starting with By = 1. a 


Example 5.21. The present example was given by Stein (1962) in order to illustrate 
a possible anomalous property of the MLE. 
Let F be a scale-parameter family of distributions, with p.d_-f. 


f (x30) = 5G) 0<6 <0, 


where 


1 a 
ne pie| 501 *) |. if0<x <b, 


0, otherwise, 


where 


| i" 
0<b<o and ay | exp s0(1 ) dx. 
o x Xx 


oi 1\? 
Note that i: exp | 50 (: ) | dx = o. Accordingly, we choose b suffi- 
o x X 


b 
ciently large so that i o(x)dx = 0.99. The likelihood function of @ corresponding 
10 


to one observation is thus 


exp{—50(0 — x)*/x2}, if~ <0 <a, 
L@;x) = b x 
0, if0<0 <=. 


The MLE of 6 is 6 = X. However, according to the construction of (x), 


be b 
P,{O>100}= | f(x; 0)dx = o(x)dx = 0.99, forall 0. 
100 10 


The MLE here is a bad estimator for all 0. | 
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Example 5.22. Another source for anomality of the MLE is in the effect of nuisance 
parameters. A very well-known example of the bad effect of nuisance parameters is 
due to Neyman and Scott (1948). Their example is presented here. 

Let (X11, Yi),...,(Xn, Yn) be n iid. random vectors having the distributions 
N(uj Ale, o*Ib), i =1,...,n. In other words, each pair (X;, Y;) can be considered 
as representing two independent random variables having a normal distribution with 
mean j; and variance o?. The variance is common to all the vectors. We note that D; = 

n 
X; — Y; ~ N(O, 207) for all i = 1,...,n. Hence, 6 = D? is an unbiased 
estimator of o7. The variance of 6? is Var{6} = 204/n. Thus, a? approaches the 
value of o” with probability 1 for all (44;, 0). We turn now to the MLE of o”. The 


parameter space is © = {1,..+, Un, 07 1-00 < pj < w,i=l,...,n,0< o< 
oo}. We have to determine a point (W1,..., Ln, 0”) that maximizes the likelihood 
function 


o2 


1 1 n 
L(t, - ey rar o y= —z exP {ses 2G _ pear + (0; - «| ; 


We note that (x; — bi) + QO; - bhi)? is minimized by fi; = (x; + y;)/2. Thus, 


1 1 n 
A ~ 2s _ y 2 
L(f,..-, bn, & 2% Y= oo |g A vi}. 


1 n 
The value of o7 that maximizes the likelihood is 67 = ip ) D?. Note that E9{67} = 
n 
i=l 


2 _; ¢?/2 with probability one 


o7/2 and that by the strong law of large numbers, & 
for each a”. 

Thus, the more information we have on o? (the larger the sample is) the worse 
the MLE becomes. It is interesting that if we do not use all the information available 
then the MLE may become a reasonable estimator. Note that at each given value 
of 0”, M; = (X; + Y;)/2 is a sufficient statistic for u;. Accordingly, the conditional 
distribution of (X, Y) given M = (M,,..., M,)' is independent of jw. If we consider 
the semi-likelihood function, which is proportional to the conditional p.d.f. of (X, Y), 
given M and o”, then the value of o? that maximizes this semi-likelihood function 
coincides with the unbiased estimator 6. a 


Example 5.23. Consider the standard logistic tolerance distribution, i.e., 


= 1 _ e&é 
N= ast — ae oO<z7<@. 
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The corresponding p.d_f. is 


e 


The corresponding function G(z; p) given by (5.6.10) is 
G(z; p) = p—-— F(z), —-o<z<o. 


The logit, F~'(z), is given by 


= ( F(z) ) 
z = log T- Fo)" 


Let f; be the observed proportion of response at dosage x;. Define ¢; = log(25). if 
0< pj <1. 
According to the model 


( F(a + Bx;) 
lo 


ih adn a ee 
Fer) eee 


We, therefore, fit by least squares the line 


i =tog (72) = 0+ Bri, i=1,...,k 


to obtain the initial estimates of a and £. After that we use the iterative procedure 
(5.6.12) to correct the initial estimates. For example suppose that the dosages (log 
dilution) are xj = —2.5, x2 = —2.25, x3 = —2,x4 = —1.75, andx5 = —1.5. Ateach 
dosage a sample of size n = 20 is observed, and the results are 


Ki- 2:5 —2.25 —2 =115 -1.5 
Di 0.05 0.10 0.15 0.45 0.50 
Z —2.9444 —2.1972  —1.7346 —0.2007 0 


Least-squares fitting of the regression line 2; = & + Ax; yields the initial estimates 
& = 4.893 and B = 3.154. Since G’(z; p) = — f(z), we define the weights 


Ww? =n, exp(@ + px) 
TE expla + BOx)? 


and 


(i) 7 exp(@Y + BYx;) 
Y"" =n; | pi ; ~ : 
1+ exp(@? + BY x;) 
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We solve then equations (5.6.12) to obtain the corrections to the initial estimates. The 
first five iterations gave the following results: 


j a) pb» 
0 4.89286 3.15412 
1 4.93512 3.16438 
2 4.93547 3.16404 
3. 4.93547 3.16404 
4 4.93547 3.16404 
| 
Example 5.24. X,,...,X, are iid. random variables distributed like N(y, 0”), 


where —oo < ft < 0,0 <o < oo. The group G considered is that of the real-affine 


' _ l n n 7 
transformations. A m.s.s. is (X, Q), where X = ~S°X; and QO = wee _ Xe If 
n 
i=l i=1 
[a, B] is an element of G then 


[a, Blx, =a+ Px, i=l,...,n, 
la, B\(u, 0) = (a + Bu, Bo), 


and 
[w, B\(X, Q) = (w + BX, f° Q). 
If A(X, Q) is an equivariant estimator of jz then 
fila + BX, B°Q) = a+ BA(X, Q) = [a, BIA(X, Q) 


for all [a, B] € G. Similarly, every equivariant estimator of o7 should satisfy the 
relationship 


[a, B]G7(X, Q) = B°6°(X, Q), 


for all [a, 8B] € G. The m.s.s. (X, Q) is reduced by the transformation [—X, 1] 
to (0, Q). This transformation is a maximal invariant reduction of (X, Q) with 
respect to the subgroup of translations G; = {[a, 1], -coo < a < oo}. The difference 
D(X, Q) = fi(X, Q) — X is translation invariant, i.e., [w, 1] D(X, 0) = D(X, Q) for 
all [a, 1] € G,. Hence, D(X, Q) is a function of the maximal invariant with respect 
to G,. Accordingly, every equivariant estimator can be expressed as 


(U(X, O)= X + f(Q), 
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where f(Q) is a statistic depending only on Q. Similarly, we can show that every 
equivariant estimator of o* should be of the form 


RO 
o°(X, Q)=4Q, 
where A is a positive constant. We can also determine the equivariant estimators of 


and o* having the minimal MSE. We apply the result that X and Q are independent. 
The MSE of X + f(Q) for any statistic f(Q) is 


E{[X + f(Q)— nV} E(E(LX —pw+ f(Q)P | OF 


o 
— + E{ f?(Q)}. 
n 


Hence, the MSE is minimized, by choosing f(Q) = 0. Accordingly, the sample 
mean, X is the minimal MSE equivariant estimator of jw. Similarly, one can verify 
that the equivariant estimator of o7, which has the minimal MSE, is 6? = Q/(n + 1). 
Note that this estimator is biased. The UMVU estimator is Q/(n — 1) and the MLE 


is Q/n. a 
Example 5.25. Let X,,...,X, be iid. random variables having a common 
N(u, 07) distribution. Let Y,,...,Y, be i..d. random variables distributed as 


N(w, 03). The X and the Y vectors are independent. The two distributions have 
a common mean 4, —oo < fl < ~%, and possibly different variances. The variance 
ratio p = 03/07 is unknown. A m.s.s. is (X, Q(X), ¥, Q(Y)), where X and Y are 
the sample means and Q(X) and Q(Y) are the sample sums of squares of deviations 
around the means. X, Q(X), Y, and Q(Y) are mutually independent. Consider the 
group G of affine transformations G = {[a, B] : -coo < a < ~,—-C@ < B < co}. A 
Po Ope) et W = P= ®. 
Ce Oat Cae oo 
The vector (W, V) is also a m.s.s. Note that 


maximal invariant statistic is V = 


[a, BW, V) = (w+ BX, BY — X), V), 


for all [a, 8B] € G. Hence, if i(W, V) is an equivariant estimator of the common mean 
it should be of the form 


AW, V)=X+(¥% — X)W(V), 
where y(V) is a function of the maximal invariant statistic V. Indeed, Y 4 X with 
probability one, and (f4(W, V) — X)/(¥ — X) is an invariant statistic, with respect to 


G. We derive now the MSE of fi(W, V). We prove first that every such equivariant 
estimator is unbiased. Indeed, for every 6 = (ju, Ge p) 


Eo{i(W, V)} = Eo{X + (¥ — X)W(V)} = w+ Eol(¥ — X)W(V)}. 
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Moreover, by Basu’s Theorem (3.6.1), V is independent of (X, Y). Hence, 
Eol(Y — X)W(V) | |¥ — X]} = E{W(V)} Ea — X) | 1¥ — X1} =0, 


with probability one, since the distribution of Y — X is symmetric around zero. This 
implies the unbiasedness of (4(W, V). The variance of this estimator is 


2; 
Vo{X + (¥ — X)W(V)} = “1 + 2cove(X, (Y — X)w(V)) 
+ Vol — X)W(V)}. 


Since Eg {(Y — X)w(V)} = 0, we obtain that 
cove(X, (Y — X)W(V)) = Eo{(X — wy(¥ — X)W(V)}. 


The distribution of X — u depends only on or The maximal invariant statistic V is 
independent of jz and or It follows from Basu’s Theorem that (X — 2) and w(V) 
are independent. Moreover, the conditional distribution of X — yz given Y — X is the 


sya fo Te. eae ue Sto Vy) 
normal distribution N (Y — X), . Thus, 
l-pe n 1l+o 


cove(X, (Y — X)W(V)) = EVE — X)Eo{(X — w) | ¥ — X}} 
= ——— Eg{w(V\¥ — XY}. 


1+ p 
The conditional distribution of (Y — X)? given V is the gamma distribution G(, v) 


with 
ey ee peices ; 
=> —) and v=n--, 

207 \l+p soa 2 


where Z; = O(X)/(¥ — X) and Z> = Q(Y)/(Y — X)?. We thus obtain the expres- 
sion 


2: 
Vo{a(W, V)} = 7 (: + (2n — DE» ( [v2 Zr) 


l+p 
= Has. 22)| )). 
Z itp Z 
1 144+ p—+—*.= 


p n 


We see that in the present example the variance divided by o; 2 /n depends not only on 
the particular function w(Z,, Zz) but also on the (nuisance) parameter p = 0; Wire 

This is due to the fact that o is invariant with respect to G. Thus, if p is unknown 
there is no equivariant estimator having minimum variance for all 6 values. There are 
several papers in which this problem is studied (Brown and Cohen, 1974; Cohen and 
Sackrowitz, 1974; Kubokawa, 1987; Zacks, 1970a). | 
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Example 5.26. Let X,,..., X,, bei.i.d. random variables having a common Weibull 
distribution G!/8(A-, 1), where 0 < A, 8 < oo. Note that the parametrization here 
is different from that of Example 5.20. The present parametrization yields the desired 
structural properties. The m.s.s. is the order statistic, T(X) = (X(1),..., X(n)), where 
Xa) <... < Xj). Let MT) and B(T) be the MLEs of A and 8, respectively. We 
obtain the values of these estimators as in Example 5.20, with proper modification of 
the likelihood function. Define the group of transformations 


G ={[a,b]; O<a, b < x}, 
where 
[a, b]X; =aX}", i=1,...,n. 
Note that the distribution of [a,b]X is as that of aa!/?G!/6°(1, 1) or as that of 


G'/Pb((aa!/>)-P 1). Accordingly, if X — [a, b]X then the parametric point (A, B) 
is transformed to 


[a, b\(A, B) = (ar, bB). 
It is easy to verify that 
[a, bllc, d] = [ac'/”, bd] 


and 


[a,b]! = Ee 5]: 
a’ b 


The reduction of the m.s.s. T by the transformation [A, B]~! yields the maximal 
invariant U(T) 


a BIB a\F BIB 
U(Xay),---, Xa) = (;) Ga) on) Ga) ; 


where Gi) < ... < Gy) is the order statistic of n i.i.d. E(1) random variables. The 
distribution of U(T) does not depend on (A, 8). Thus, Gy? is distributed indepen- 
dently of (A, 8) and so is that of B/B. 


LetA = F(A, B, U(T))and B = G(A, B, U(T)) be equivariant estimators of A and 
B respectively. According to the definition of equivariance 
A, BIG, B, UT) = (FG, B, Ua) /iP 
= F(A, BIA, (A, BIB, UT) 
= F(1, 1, U(T)) = wU(T)). 
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Accordingly, every equivariant estimator of i is of the form 
i= ium. 
Similarly, every equivariant estimator £ is of the form 
B = BHU (1). 


Note that the relationship between the class of all equivariant estimators (A, 8) and 
the MLEs (A, A). In particular, if we choose y(U(T)) = 1 w.p.l and H(U(T)) = 1 
w.p.l we obtain that the MLEs i and f are equivariant. This property also follows 
from the fact that the MLE of [a, b](A, B) is [a, D(A, B) for all [a, b] in G. We 
will consider now the problem of choosing the functions H(U(T)) and w(U(T)) 
to minimize the risk of the equivariant estimator. For this purpose we consider a 
quadratic loss function in the logarithms, i.e., 


L(A, B), (A =(1 i. Rae. 
(é, B), (. B)) = oz (+) +(tog5) . 


It is easy to check that this loss function is invariant with respect to G. Furthermore, 
the risk function does not depend on (A, 8). We can, therefore, choose y and H to 
minimize the risk. The conditional risk function, given U(T), when w(U(T)) = w 
and H(U(T)) = H, is 


(ve(E")) 10} 2 mm) 
Rb, H) = E 4 [log] ——]} ] |up +E log | U 


2 


A 


B i 2 
A 
= H’E { | log (;) t+logw| |U; +E et lt |U 


Since ye and 5 are ancillary statistics, and since T is a complete sufficient statistic, 


we infer from Basu’s Theorem that (4) and 5 are independent of U(T). Hence, 
the conditional expectations are equal to the total expectations. Partial differentiation 
with respect to H and w yields the system of equations: 


a\b 2 , 
Q) A°E% | lo is + logy rae ale +log H° |! =0 
sl, g Ho 8\ 3 g = 0. 


A 


A B 
(ql) E we (5) +logy°} =0. 
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From equation (IJ), we immediately obtain the expression 


iV" 
w =exp;—E 4 log . 


Substituting this y° in (I), we obtain the equation 
a\B * 
0y2 d 0 B 
(H")’V + log - +log H+ E log = 0. 


This equation can be solved numerically to obtain the optimal constant H°. Thus, 
by choosing the functions y(U) and H(U) equal (with probability one) to the con- 
stants y° and H°, respectively, we obtain the minimum MSE equivariant estima- 
tors. We can estimate y° and H® by simulation, using the special values of A = 1 
and 6 = 1. a 


Example 5.27. As in Example 5.15, let X;,..., X, bei.i.d random variables having 
a Laplace distribution with a location parameter jz and scale parameter 6, where 
—o < pu < oand0 < B < o. The two moments of this distribution are 


MW = Mh p= 26+. 


3 ee 
The sample moments are M, = X and M) = -S°x af Accordingly, the MEEs of 
n 
i=l 
and # are 


=k p=6/V2, 


where 62 = M, — M fi = Sa oo X y: It is interesting to compare these MEEs 
i=l 

to the MLEs of yw and £ that were derived in Example 5.15. The MLE of ju is the 
sample median M,, while the MEE of ju is the sample mean X. The MEE is an unbi- 
ased estimator of jz, with variance V{X} = 2A7/n. The median is also an unbiased 
estimator of ju. Indeed, letn = 2m + 1 then Me ~ w+ BY(m+1), where Y¢n41) is the 
(m + 1)st order statistic of a sample of n = 2m + 1 i.1.d. random variables having a 
standard Laplace distribution (u = 0, 6B = 1). The p.d.f. of Y(,41) is 


2. 
a) = Se FODPMONL = FOO wo <y <0, 


where 


fO)=- 5exp(-lylh —0oo<y<oo 
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and 
F(y) = 4 2 


It is easy to verify that g(—y) = g(y) for all —oo < y < ow. Hence, E{Yin4i)} = 0 
and E{M,} = w. The variance of M,, for m > 1, is 


V{M.} = 07? V{Yinn} 


Aam+1)! ¢% 1 e 
= oe if ene (1 ~ se) a 
(2m + 1)! = —(m+j ) 
=O" Saal a 5) & )f pena 
oe +1)! 1 
Gn? ae 5) i ) (+j+m)- 


Thus, for 6 = 1, one obtains the following values for the variances of the estimators: 


Es. m=1 m=2 m=3 m=10 m=20 


M. 0.3194 0.1756 0.1178 0.0327 0.0154 
Xs 0.6666 0.4000 0.2857 0.0952 0.0488 


We see that the variance of M, in small samples is about half the variance of Xi 
As will be shown in Section 5.10, as n — oo, the ratio of the asymptotic variances 
approaches 1/2. It is also interesting to compare the expectations and MSE of the 
MLE and MEE of the scale parameter 6. a 


Example 5.28. Let X),... , X, be ii.d. random variables having a common log- 
normal distribution EN; a7), —00 < a < oo, and0 < 0% <oo. Let Y; = log X;, 


al bees = =i and G -7o0 _ Vy. Y, and 6? are the MLEs of jz 


and o°, respectively. We derive now the MEEs of wz and o”. The first two moments 
of LN(u, 07) are 


wy =exp{u+o7/2} wp = exp{2 + 267}. 


Accordingly, the MEEs of jz and o? are 


1 
fi = 2log M, — 3 og Ma = log Mz — 2log M,, 
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2 ee 
where M, = X, and M> = -S°Xx : are the sample moments. Note that the MEEs ji 
n 
i=l 
and G? are not functions of the minimal sufficient statistics Y, and 6”, and therefore 
are expected to have larger MSEs than those of the MLEs. a 


Example 5.29. In Example 5.20, we discussed the problem of determining the 
values of the MLEs of the parameters 4 and 6 of the Weibull distribution, where 
Xy,..., X, are iid. like G!/8(A, 1) where 0 < B, 4 < 00. The MEEs are obtained 
in the following manner. According to Table 2.2, the first two moments of G!/F(A, 1) 
are 


wy =TU+1/B)/AYP yp =P 4+ 2/8)/a77. 
Thus, we set the moment equations 
M,=TU+1/B)/AV4 ~My =P +2/f)/A2/. 
Accordingly, the MEE f is the root of the equation 
2B | Mp 


The solution of this equation can be obtained numerically. After solving for 8, one 
obtains 4 as follows: 
a\\ B 
,_ (Ea+1/A) 
oi ear’ er 


We illustrate this eoley with the numbers 1 in the sample of Example 5.14. In that 


sample, n = 50, x = 46.6897, and Sore = 50.9335. Thus, M, = .9338 and 


i=1 
M> = 1.0187. Fauation (5.8.9) becomes 


1711958 = B (5 s) 
: =B\ 3.3). 


The solution should be in the neighborhood of 6 =2, since 2 x 1.71195 = 
3.4239 and BG, 5) =m =3.14195.... In the following table, we approximate the 
solution: 
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B- TG) Brg) > 1711958 


2.55 2.21815 4.22613 4.2798 
2.6 2.30936 4.44082 4.4510 
2.7 2.40354 4.66284 4.6222 


Accordingly, the MEE of f is approximately 6 = 2.67 and that of A is approximately 
i = 0.877. The values of the MLE of B and A, obtained in Example 5.20, are 1.875 
and 0.839, respectively. The MLEs are closer to the true values, but are more difficult 


to obtain. a 
Example 5.30. 
A. Let (X41, Yi),..-, (Xn, Yn) be ii.d. random vectors having a bivariate nor- 


B. 


— 1 
L@,y,0° | X,Y, Q=" D exp 
fon nA 


1 p 


mal distribution N(0, R), where R = ( 1 ) —1 <p <1. Accordingly, an 


n 


estimator of ¢ is the sample mixed moment M,; = -)°X ;Y;. This is also an 
n 


i=l 
unbiased estimator of 0. There is no UMVU estimator of ¢, since the family 
of all such distributions is incomplete. 
The likelihood function of ¢ is 


L(p;X, Y) = 


1 1 
ae | sl Ox + Or — 20PerIf. 


where Qy = DX?, OQy= DY? and Pyy = 2 X;Y;. Note that the m.s.s. is 


T =(Qy + Qy, Pxy). The maximal likelihood estimator of ¢ is areal solution 
of the cubic equation 


np> — p’ Pxy +(S —n)p — Pxy = 0, 


where S = Q, + Qy. In the present example, the MEE is a very simple esti- 
mator. There are many different unbiased estimators of o. The MEE is one 
such unbiased estimator. Another one is 


1 
PS eG Soi) 
n 


Consider the model of Example 5.10. The likelihood function is 


n 
202 


[x - oF +r yor? + SI] 
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—0o < 0, y < 0, 0 <0? < oo. The MEE of o? is 6? = 


find that the MEEs of 6 and y are 


Q . Similarly, we 
2n 


a eee 4 
The MLEs are the same. P| 
Example 5.31. Let X,,...,X, be iid. random variables having a common 
N(u, 07) distribution. The problem is to estimate the variance o7. If jz = 0 then 
ou abi ; 1G, 
the minimum MSE equivariant estimator of o? is 63 = mag ke On the 


other are : w is unknown the minimum MSE equivariant estimator of o7 is 

a. = De - X,)°, where X, = ee nee One could suggest to test first the 
n+1 

hypothesis ee pe = 0, o arbitrary, ae Hy, : ae 40,0 arbitrary, at some level of 

significance a. If Hp is accepted the estimator is orn ; OMe, itis & 7. Suppose that 

the preliminary test is the t-test. Thus, the estimator of o7 assumes the form: 


Jn|Xn| 


6? = 6dl{X, S*: 


S tapln — 1] 


where S? is the sample variance. Note that this PTE is not palsanon invariant, 

since neither the t-test of Ho is translation vanany nor a . The estimator co? 

may have smaller MSE values than those of oe or of ae a some intervals of 

(41, 07) values. Actually, 6? has smaller MSE than that of 6; for all (u, 07) if 
n—-1 


tan — 1] = ~ |. This corresponds to (when n is large) a value of a 


n 
approximately equal to a = 0.3173. a 


Example 5.32. Let X,,..., X, bea sample of i.i.d. random variables from N(j, a?) 
and let Y,,..., Y, be a sample of 1.i.d. random variables from N(y, ae): The X and 
Y vectors are independent. The problem is to estimate the common mean w. In 
Example 5.24, we studied the MSE of equivariant estimators of the common mean 
jt. In Chapter 8, we will discuss the problem of determining an optimal plata 
estimator of jz in a Bayesian framework. We present here a PTE of wz. Let p = 05 ad Oe 

If o = 1 then the UMVU estimator of ju is 4; = (X + Y)/2, where X and Y are the 
sample means. When p is unknown then a reasonably good unbiased estimator of 
is ip = (XR+Y)/(R + 1), where R = S?/S%, is the ratio of the sample variances 
2 to Se. A PTE of uw can be based on a preliminary test of Hp : o = 1, u, o1, o2 
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arbitrary against H, : p ~ 1, 1, 01, 02 arbitrary. If we apply the F-test, we obtain the 
PTE 


A= Al{R < F\-a[n — 1,n— 1]} + ArI{R > Fi_a[n — 1,n — 1}. 


This estimator is unbiased, since X and Y are independent of R. Furthermore, 


2 
1 
a1 TF, ifR < Fh\_g[n—1,n— 1], 
V{A| R} = 2 pe? 
—-—_—__, ifR> F\_g{n-1,n—-1]. 
n (1+R/ 


Hence, since E{j | R} = pw for all R, we obtain from the law of total variance that 
the variance of the PTE is 


4 


ee) 2 
+ / pa (Rd), 
R 


2/] 1 
viay = 2 ( =? Pita An so Folm—1.n— 1) 


» (L+RY 


where R* = F\_,[n — 1,n — 1], and f,(R) is the p.d-f. of pF[n — 1,n— 1] at R. 
Closed formulae in cases of small n were given by Zacks (1966). a 
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Section 5.2 
5.2.1 Let X),..., X, be iid. random variables having a rectangular distribution 
R(O,, 62), —CO < 0; < 02 < &. 
(i) Determine the UMVU estimators of 0; and 4). 
(ii) Determine the covariance matrix of these UMVU estimators. 


5.2.2 Let X,,..., X, bei.i.d. random variables having an exponential distribution, 
E(A), 0 < 24 < oo. 


(i) Derive the UMVU estimator of 4 and its variance. 
(ii) Show that the UMVU estimator of p = e~* is 


where T = oxi and at = max(a, 0). 


i=1 
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(iii) Prove that the variance of (6 is 
-1 
1 n 


i ;{2n —2 : : 
V8 = is ( ; Jo i — 1)!P(n —i — 1;A) 


2n—2 wn —2 
—Ay Hii-n+1|a)-e™, 
+ Ba")? nstin)- 
i=n 66. 
where P(j;4) is the c.d.f. of P(A) and H(k | x)= e" /ukdu. 
[H(k | x) can be determined recursively by the relation 


e~* 


1 
a& 1) = (SS H(k 110). k>2 


and H(1|x) is the exponential integral (Abramowitz and Stegun, 1968). 


5.2.3 Let X,,..., X, be i.i.d. random variables having a two-parameter exponen- 
tial distribution, X¥; ~ 4 + G(A, 1). Derive the UMVU estimators of js and 
i and their covariance matrix. 


5.2.4 Let X1,..., X, bei.id. N(u, 1) random variables. 
(i) Find a A(m) such that ®(A(n)X) is the UMVU estimator of ®(1). 
(ii) Derive the variance of this UMVU estimator. 


5.2.5 Consider Example 5.4. Find the variances of the UMVU estimators of p(0; 2) 
and of p(1;A). [Hint: Use the formula of the p.g.f. of a P(nA).] 


5.2.6 Let X,,..., X, be iid. random variables having a NB(w, v) distribution; 
0 < & < co (v known). Prove that the UMVU estimator of y is 


5 a ” 
= ————., where T = Xj. 
¥ nv+T—1 dX, 
5.2.7 Let X,..., X;, be iid. random variables having a binomial distribution 


B(N,60),0 <0 <1. 

(i) Derive the UMVU estimator of 0 and its variance. 

(ii) Derive the UMVU estimator of o?(6) = 6(1 — 8) and its variance. 
(iii) Derive the UMVU estimator of b(j; N, 0). 


5.2.8 Let X),..., X, be ii.d. N(w, 1) random variables. Find a constant b(n) so 
that 


fEX= 


1 1 ~~ 
=| Ion) © «| 
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5.2.9 


5.2.10 


5.2.11 


5.2.12 


5.2.13 


5.2.14 


5.2.15 


5.2.16 


5.2.17 


is a UMVU estimator of the p.d.f. of X at &, ie., Tz exp{—5(E = sy}. 
[Hint: Apply the m.g.f. of (X — &).] 


Let Jj,..., J, be ii.d. random variables having a binomial distribution 
” 1 

B(1, e74/°),0 < @ < 1(A known). Let Pn = (4 + 5) /(n + 1). Con- 
i=l 


sider the estimator of 0 


nN 


6, = —A/log(p,). 
Determine the bias of 6,, as a power-series in 1/n. 


Let X,,..., X, be i.i.d. random variables having a binomial distribution 
B(N, 6), 0 < 6 < 1. What is the Cramér—Rao lower bound to the variance 
of the UMVU estimator of w = 0(1 — 6)? 


Let X,,..., X, be i.i.d. random variables having a negative-binomial dis- 
tribution NB(y, v). What is the Cramér—Rao lower bound to the variance of 
the UMVU estimator of wy? [See Problem 6.] 


Derive the Cramér—Rao lower bound to the variance of the UMVU estimator 
of 6 = e~* in Problem 2. 


Derive the Cramér—Rao lower bound to the variance of the UMVU estimator 
of ®(j2) in Problem 4. 


Derive the BLBs of the second and third order for the UMVU estimator of 
®(j2) is Problem 4. 


Let X,,..., X, be iid. random variables having a common N(i, a”) dis- 
2 


tribution, —co < " < 00,0<0° < ~@. 
(i) Show that @ = exp{X} is the UMVU estimator of w = exp{u+ 
o /2n}. 
(ii) What is the variance of @? 
(iii) Show that the Cramér—Rao lower bound for the variance of @ is 


2 2 
(oy 2 (oy 
2ut+o-/n 1 
—e +=, |]. 
n ( -;) 


Let X,,..., X, be ii.d. random variables having a common N(y, a7) dis- 
tribution, —co < pp < 00, 0 < o < ow. Determine the Cramér—Rao lower 
bound for the variance of the UMVU estimator of w = ww + z,o0, where 
Zy =O) Oey <1, 


Let X,,..., X, be iid. random variables having a G(A, v) distribution, 
0<i <o,v > 3 fixed. 
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(i) Determine the UMVU estimator of A. 
(ii) Determine the variance of this UMVU. 


(iii) What is the Cramér—Rao lower bound for the variance of the UMVU 
estimator? 


(iv) Derive the BLBs of orders 2, 3, and 4. 
5.2.18 Consider Example 5.8. Show that the Cramér—Rao lower bound for the 


4 
variance of the MVU estimator of cov(X, Y) = po? is ee +7). 
n 


5.2.19 Let X1,..., X, be iid. random variables from N(W1, 07) and Y\,..., Yn 
iid. from N(t2, 03). The random vectors X and Y are independent and 
n> 3.Letd= eed ere 

(i) What is the UMVU estimator of 5 and what is its variance? 
(ii) Derive the Cramér—Rao lower bound to the variance of the UMVU 
estimator of 6. 


5.2.20 Let X,,..., X, be iid. random variables having a rectangular distribu- 
tion R(O, 6), 0 < @ < oo. Derive the Chapman—Robbins inequality for the 
UMVU of 6. 


5.2.21 Let X1,...,X, be iid. random variables having a Laplace distribu- 
tion L(u, a), —oo < “ < 0,0 <o < o. Derive the Chapman—Robbins 
inequality for the variances of unbiased estimators of ju. 


Section 5.3 


5.3.1 Show that if @(X) is a biased estimator of 0, having a differentiable bias 
function B(@), then the efficiency of 6(X), when the regularity conditions 


hold, is 
se (LAB OY 
E9(0) = ——_———_. 
T,(0)Vo{O} 
5.3.2 Let X;,..., X, be iid. random variables having a negative exponential 


distribution G(A, 1),0 <A < ~@. 
(i) Derive the efficiency function €(A) of the UMVU estimator of 1. 
(ii) Derive the efficiency function of the MLE of i. 


5.3.3. Consider Example 5.8. 
(i) What are the efficiency functions of the unbiased estimators of o” and 


1 n 
p, where 6 = UXY;/uX? and 62 = Lee + y?), 
i=l 


(ii) What is the combined efficiency function (5.3.13) for the two estimators 
simultaneously? 
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Section 5.4 


5.4.1 


5.4.2 


5.4.3 


5.4.4 


Let X1,..., X, be equicorrelated random variables having a common 
unknown mean ju. The variance of each variable is o7 and the correlation 
between any two variables is p = 0.7. 

(i) Show that the covariance matrix of X =(X),...,X,) is Y= 
0° (0.31, + 0.7Jn) = 0.307 + 4Jn), where I, is the identity matrix 
of order n and J, is ann x n matrix of 1s. 

(ii) Determine the BLUE of w. 

(iii) What is the variance of the BLUE of «1? 


(iv) How would you estimate o?? 


Let X1, X2, X3 be i.i.d. random variables from a rectangular distribution 
R(u—0, 4+0),—-C <  < 00,0 <0 < &. Whatis the best linear com- 
bination of the order statistics X(), i = 1, 2, 3, for estimating 4, and what 
is its variance? 


Suppose that X;,..., X, are ii.d. from a Laplace distribution with p.d_-f. 
1 x 
Tore) = ~v( M),-00 < x < 00, where w(z) =} 
oO oO 
[L < 00,0 <a < o. What is the best linear unbiased combination of X(1), 
M., and X(n) for estimating w, when n = 5? 


el; 90 < 


T 
Let W(T) = ye. 
t=1 


(i) Show that 


P 
ys (’ a ‘) iar) = (r+ 1, 


k=0 k 


(ii) Apply (i) to derive the following formulae: 


= 1 
i= sPr +), 

t= 

t 1 
ere ae AT + 1), 
f 1 
Ye = aoe ay, 


£ 
1 
yee 39°F + DAT + IGT? +3T — 1), 
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A 1 
eS Te Seat a OF 1) 


T 
1 
LS ree DOr NBT* + 6T? — 3T +1). 


T T 
[Hint: To prove (i), show that both sides are yt + jth — Pas 
t=1 


t=1 


(Anderson, 1971, p. 83).] 


5.4.5 Let X,; = f(t) +e,, where t = 1,..., 7, where 
2 : 
fO=) At, t=1,...,7; 
i=0 


e,; are uncorrelated random variables, with E{e,;} = 0, V{e,;} =o? for all 


CSA eels 
(i) Write the normal equations for the least-squares estimation of the poly- 
nomial coefficients 6; (i = 0,..., p). 


(ii) Develop explicit formula for the coefficients 6; in the case of p = 2. 

(iii) Develop explicit formula for V{A;} and o? for the case of p = 2. [The 
above results can be applied for a polynomial trend fitting in time series 
analysis when the errors are uncorrelated.] 


5.4.6 The annual consumption of meat per capita in the United States during the 
years 1919-1941 (in pounds) is (Anderson, 1971, p. 44) 


t 19 20 21 22 23 24 25 26 27 
X, 1715 167.0 1645 169.3 1794 179.2 172.6 170.5 168.6 


t 28 29 30 31 32 33 34 35 36 
X, 164.7 163.6 162.1 160.2 161.2 165.8 163.5 146.7 160.2 


t Bi 38 39 40 4] 
X, 1568 156.8 165.4 174.7 178.7 


(i) Fit a cubic trend to the data by the method of least squares. 


(ii) Estimate the error variance o” and test the significance of the polyno- 
mial coefficients, assuming that the errors are i.i.d. N(O, a”). 
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5.4.7 Let (41;, Y1;), i = 1,...,1, and (%;, Yo;), i = 1,..., 2, be two indepen- 
dent sets of regression points. It is assumed that 


Yji = Boj + Bixji + eji pa 2 SS ee Ni 


where x ;; are constants and ej; are i.i.d. N(O, o”). Let 
nj 
SDX; = (xj — ¥)). 
i=l 


SPD; =) @a-#)0n-¥), 7 =1,2, 


i=1 
nj 

SDY; = 0% — YY, 
i=l 


where x; and Y; are the respective sample means. 
(i) Show that the LSE of f; is 


na 


2 
S “SPD; 
=e 
j=, 
y “SDX; 
j=l 
and that the LSEs of Bo; (j = 1, 2) are 
Boj =Y;- Bi xj. 
(ii) Show that an unbiased estimator of o7 is 


2 2 
1 7 
j=l j=l 


where N = nj + no. 
(iii) Show that 


2 
Vipi} =07/ )\SDX;; V{Aoj} = — J 1+ =— 


2 
i= : )“SDX; 
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Section 5.5 


5.5.1 


Consider the following raw data (Draper and Smith, 1966, p. 178). 


Y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4 
xX; 7 1 11 11 7 11 3 1 2, 21 1 11 10 
X, 26 29 56 31 652 55 71 31 = 54 47 40 66 68 
X3 6 15 8 8 6 9 17. 22 18 4 23 9 8 
X4 60 52 20. 47 33 22 6 44 22 26 8934 12 12 


(i) Determine the LSE of fo, ..., 84 and of o7 in the model Y = By + 
Bi X;+...+ BsX4 +e, where e ~ N(0, 07). 
(ii) Determine the ridge-regression estimates 6;(k), i =0,...,4 fork = 
0.1, 0.2, 0.3. 
(iii) What value of k would you use? 


Section 5.6 


5.6.1 


5.6.2 


5.6.3 


5.6.4 


5.6.5 


Let X,,..., X, be i.i.d. random variables having a binomial distribution 
Bd, 9), 0 <@ < 1. Find the MLE of 


(i) o? = O(1— 8); 

(ii) p =e; 
(iii) @ = ec? /(1+e7%); 
(iv) 6 = log +8). 


Let X1,..., X, be ii.d. P(A), 0 < A < oo. What is the MLE of p(j;4) = 
e*-N/j!, 7 =0,1,...? 


Let X,,..., X,, beiid. N(w, 07), —co < pp < 00,0 <o < oo. Determine 
the MLEs of 


(i) w+ Z,o, where Z, = O'(y),0< y <1; 
(ii) @(u, 0) = P(u/o)- [1 — P(u/o)). 


Using the delta method (see Section 1.13.4), determine the large sample 
approximation to the expectations and variances of the MLEs of Problems 
1, 2, and 3. 


Consider the normal regression model 


Y; = Bo + Bix + ei G@=1,...,n), 


n 
where x,,...,X, are constants such that ) (xj — xy > 0, and e;,..., ey 
i=l 


are i.i.d. N(O, 07). 
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(i) Show that the MLEs of Bp and 6; coincide with the LSEs. 
(ii) What is the MLE of o?? 


5.6.6 Let (x;, T;),i = 1,...,n be specified in the following manner. x1, ..., x, 


5.6.7 


n n 
1 
Paty) = A 
are constants such that xc? —xy >O0,x= ~ DMs T,,..., T;, are inde- 


i=l i=l 
1 
pendent random variables and 7; ~ G ( !) i=l=l,...,n. 
a+ Bx; 


(i) Determine the maximum likelihood equations for a and f. 


(ii) Set up the Newton-Raphson iterative procedure for determining the 
MLE, starting with the LSE of @ and £ as initial solutions. 


Consider the MLE of the parameters of the normal, logistic, and extreme- 
value tolerance distributions (Section 5.6.6). Let x, < ... < x, be controlled 
experimental levels, n;,...,, the sample sizes and Jj, ..., J; the number 
of response cases from those samples. Let p; = (J; + 1/2)/(n; + 1). The 
following transformations: 
1. Normit: ¥, = ®~'(p;),i = 1,...,k; 
2. Logit: ¥; = log(p;/( — p;)),i =1,...,k 
3. Extremit: Y; = — log(— log p;), i = 1,..., k; are applied first to deter- 
mine the initial solutions. For the normal, logistic, and extreme-value 
models, determine the following: 
(i) The LSEs of 6; and 6> based on the linear model Y; = 0; + 02x; + @; 
(i =1,...,k). 
(ii) The MLE of 6; and 62, using the LSEs as initial solutions. 


(iii) Apply () and (11) to fit the normal, logistic, and extreme-value models 
to the following set of data in which k = 3; nj = 50 Gi = 1, 2,3); 
Xj] = —-1,%=0,%43= 1) = 15, Jn = 34, J; = 48. 

(iv) We could say that one of the above three models fits the data better 
than the other two if the corresponding statistic 


k 
W? = Snip? / F(x; 9) 


i=l 
is minimal; or 
k 
D?= > nj pi log F(x: 0) 
i=l 


is maximal. Determine W2 and D? to each one of the above models, 
according to the data in (iii), and infer which one of the three models 
better fits the data. 
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5.6.8 Consider a trinomial according to which (J,, Jz) has the trinomial distribu- 
tion M(n, P(@)), where 


P\(0)=0*, P)(6)=20(1—6), P(0)=(1—0y. 


This is the Hardy-Weinberg model. 
(i) Show that MLE of @ is 


is 2d 
6s pal ae 
2n 


(ii) Find the Fisher information function [,,(6). 
(iii) What is the efficiency in small samples € (6,) of the MLE? 


5.6.9 A minimum chi-squared estimator (MCE) of 6 in a multinomial model 
M(n, P(@)) is an estimator 6, minimizing 


k 
X? = S°(j —nPi(0))"/n Pi(0). 


i=1 


For the model of Problem 8, show that the MCE of @ is the real root of the 
equation 


2(2J? — J2 + 242)0° — 3(45? — JP)0? + (124? — 336 — 4? =0. 


Section 5.7 


5.7.1 Let X1,..., X, be ii.d. random variables having a common rectangular 
distribution R(0, 6), 0 < @ < oo. 
(i) Show that this model is preserved under the group of transformations 
of scale changing, i.e.,G = {gg : ggX = BX,0 < B < oo}. 


n+2 
{Xo 


(ii) Show that the minimum MSE equivariant estimator of 6 is 
n 


5.7.2 Let X,,...,X, be iid. random variables having a common location- 
1 
parameter Cauchy distribution, i.e., f(x; uw) = —U + @ — py’) |, —0o < 


xX < 00; —0O < ws < o&. Show that the Pitman estimator of ju is 


fi = Xqy- {/ w+)" ] [a+ + yal / 


i=2 


{/ d+w)'T]d+@%+ yal 


i=2 
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where Yj) = Xu) — X1y, i = 2,..., n. Or, by making the transformation 
w = (1 +7)! one obtains the expression 


A=Xay + 


Catal VEY] “ff Jo 


This estimator can be evaluated by numerical integration. 


5.7.3 Let X;,..., X, be iid. random variables having a N(w, 07) distribution. 
Determine the Pitman estimators of jz and o, respectively. 


5.7.4 Let X),..., X, bei.i.d. random variables having a location and scale param- 


1 = 
eter p.d.f. f(x;,0) = —p ¢ 
or 
and y(z) is of the form 
(i) w(z) = 4 exp{—|z|}, —oo < z < co (Laplace); 
(ii) Y(z) = 6z2(1 — z),0 <z < 1. (BQ, 2)). 
Determine the Pitman estimators of and o for (i) and (ii). 


") 
, where —co < fh < 00,0 <0 < co 


Section 5.8 
5.8.1 Let X,,..., X, bei.i.d. random variables. What are the MEEs of the param- 
eters of 


(i) NB(W,v);0< Wf <1O0<vVv<@; 
(ii) GA, v);0<rA4<w,0<v<@; 
(iii) LN (1, 02); -0o < wp < 0,0 <0? <~@; 
(iv) G'/F(A, 1);0 <A < 00,0 < B < 0 (Weibull); 


(v) Location and scale parameter distributions with p.d.f. f(x; u,0) = 


av (= 
a 


5-00 << w,0<0 < @~; with 
oO 


(a) ¥(z) = 5 exp{—|zl}, (Laplace), 


v+l 
2 


1 22\ 
(b) Y= ns (: + ) , v known, 
aya 


1 
(ec) YO= BC Ta zy?! 0 <z <1, vy and v2 known. 
V1, v2 
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5.8.3 


5.8.4 
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It is a common practice to express the degree and skewness and kurtosis 
(peakness) of a p.d.f. by the coefficients 


Bi = w/(us)? (skewness) 
and 
po = eiGey (kurtosis). 


Provide MEEs of ./6; and 62 based on samples of n i.i.d. random variables 
Kip 635, Xn: 


Let X1,..., X, be iid. random variables having a common distribu- 
tion which is a mixture aG(A, v}) + (1 —a)G(A, 2), O< a < 1,0 <A, 


Vj, V2 < oo. Construct the MEEs of a, A, v;, and v2. 


Let X,,..., X, be ii.d. random variables having a common truncated nor- 
mal distribution with p.d_-f. 


fxs, 0,8) = [nts [pee? (1 — © (*))) (x > &}, 


5 1/x-p : 
where n(x | ,o°) = exp » —0O<xX <0. 
o 


Determine the MEEs of (jz, 0, &). 


Section 5.9 


5.9.1 


5.9.2 


Consider the fixed-effects two-way ANOVA (Section 4.6.2). Accordingly, 
Xijni =1,...,m;j =1,...1r2,k = 1,...,n, are independent normal ran- 
dom variables, N(1j, a”), where 


A B AB, - 
Mg HME +t; +45" Salis JS lon aa, 73: 


f 
t? (Gi =1,...,7m; 7 =1,...,72). [The estimation is preceded by a test of 
significance. If the test indicates nonsignificant effects, the estimates are 
zero; otherwise they are given by the value of the contrasts. ] 


Construct PTEs of the interaction parameters 1/42 and the main-effects i 


Consider the linear model Y = AB + e, where Y is an N x 1| vector, A is an 
N x p matrix (p < N) and B a p x 1 vector. Suppose that rank (A) = p. 
Let B’ = (Bi), Bia), where Bq) is a k x 1 vector, | < k < p. Construct 
the LSE PTE of B(1). What is the expectation and the covariance of this 
estimator? 
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5.9.3 Let S? be the sample variance of n, iid. random variables having a 
N(t1, 07) distribution and SS the sample variance of nz i.1.d. random vari- 
ables having a N({12, 03) distribution. Furthermore, Se and se are indepen- 
dent. Construct PTEs of oy and ee What are the expectations and variances 
of these estimators? For which level of significance, a, these PTEs have a 
smaller MSE than Ss? and S} separately. 


Section 5.10 


5.10.1 What is the asymptotic distribution of the sample median M, when the i.i.d. 
random variables have a distribution which is the mixture 


0.9N(u, 07) + 0.1L(p, 0), 


L(u, o) designates the Laplace distribution with location parameter jz and 
scale parameter o. 


5.10.2 Suppose that X(1) < ... < Xo) is the order statistic of a random sample of 
size n = 9 from a rectangular distribution R(w —o, 4 +0). What is the 
expectation and variance of 


(i) the tri-mean estimator of 1; 
(ii) the Gastwirth estimator of (1? 


5.10.3. Simulate VN = 1000 random samples of size n = 20 from the distribution of 
X ~ 10+ 5t[10]. Estimate in each sample the location parameter 4p = 10 
by X, M., GL, fi.jo and compare the means and MSEs of these estimators 
over the 1000 samples. 
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5.2.1 


(i) The m.s.s. is (X(1), X(ny), Where X(1) < +--+ < Xn). The m.s.s. is com- 
plete, and 


Xay ~ 01 + (02 — &1)U a), 
Xin) © 1 + (02 — )U a), 


where U1) < --- < Uj) are the order statistics of n i.i.d. R(O, 1) ran- 
dom variables. The p.d.f. of Uj), i = 1,...,n is 


n! 


(i — D\(n — i)! 


ud —uy. 


fu Ww = 
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Thus, U; ~ Beta(i,n —i + 1). Accordingly 


1 
E(X)) = 6; + (> — 6;) —., 
(Xq) 1+(® Da 
n 
E(Xq)) = 9, +(@2 -— 0 : 
(X(n)) 1+( Dea 
Solving the equations 
6, + (@ — 4) ! =X 
1 ) i ie (1) 
46 20)— =x 
1 ) Ly es (n) 
for 6; and 65, we get UMVU estimators 
jpn x 
i ae ha 
and 
re eee 
a pee eng 


(ii) The covariance matrix of 61, 0 is 
(21-0)? (n 1 
(nt ipin+2\1 a} 


” 1 ; 
5.2.3. The mss. is nXq) and U =) (X@— Xu). U~ {Gi n= 1. A= 
i=2 
n—2 


U 


’ 


: nr-b oo eee 
E(A) = (n — 2) ————— ned 
Q) = (0-9 fates 


=. 


Thus 4 is UMVU of A, provided n>2. Xq)~ wt+G(nd, 1). 


n— 


U 
Thus, E } Xa) — ———~7 = p, since E{U} = ——. Thus, fi = X1) — 
n(n — 1) Xr 
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a isa UMVU. 
n(n — 1) 


1 ve-l E ae 2 
E\y—=} = —— Xx e dx = ——_____.. 
U2 Tin — 1) Jo (n — 2)(n — 3) 


Thus, 


= 2 
vin = 22 (2 a 1) = é 
n—3 


provided n > 3. Since X (1) and U are independent, 


K U 
V{i} — V {Xu = nn—1 = "i 
1 
= V{Xay} + Guin 
1 1 n—1 


+ 
n22 n2(n— 1)? 1 


Pee eee l 
~~ 7242 n—-1) n(n—1)A2 


forn > 1. 
Gh. a) n—2 x U 
cov(A, = COV : 
i yO -Fa—D 
n—2 U 
= —cOoVv : 
U = n(n— 1) 
= 1 
~ n(n — 1) 
5.2.4 
= 1 
() X1,...,X, iid. N(u, 1). Xn ~ N (u. =) Thus, 
n 


Mn) 


(a) 
ieee 


E{®(A(n)X)} = ® 
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Xr 1 
Set I) Fy 20) = 142 (n) = p> A(n) = 
eae ae 
: xX 
. The UMVU estimator of ®(jz) is ® 
y—1 y—1 
re xX 
(ii) The variance of ® is 
y—1 
xX 
Eo? — ©7(). 
y—1 
If Y ~ N(n, t7) then 
E{O(Y)} = ( c aoe ) 
Viet? Jipe 1+? 
In our case, 
Xx ae 
Vi® = 2 (1, u;—] — O(u). 
i= 


5.2.6 X1,...,Xn ~ iid. NB(W, v), 0 < Wy < c v is known. The mss. is T = 
x ~ NB(W, nv). 


i=l 


~ Trav +4) t : as 
e\——] = Sra nv+t— ha ¥) 


t=1 


— = T(nv+t— t-l nv 
= 2G irs oy (-w) 


(nv +f) pio 
=¥ 2 Siren aan 
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- 1 - 1 

5.2.88 (X—&)~N(w—€é,-—). Hence, (X —&) ~ —x7[1;A], where A= 
n n 


_y2 
shea . Therefore, 


oo ; -—5-j 
E{etxtal/ny _ 9 = = (1 = =) : 
jy! n 


2 
exp alee : = eT Hey 
2b(n) 14+ ah 


1 1 n-I1 
Thus, b(n) + — = lor b(n) =1-—--= ‘ 
n n 


n 


5.2.9 The probability of success is p = e~4/°. Accordingly, 9 = —A/log(p). 
Ss » _ Pn t+1/2 . ee 
Let p, = J;/n. Thus, p, = ———. The estimator of 9 is 6, = 
bam Sin Ts, y= 20 


i=1 


a ~ 1 1 
—A/ log(p,). The bias of 6, is B(@,) = —A (z {| a can): Let 
log pnJ — log(p) 


8(Pn) = . Taylor expansion around p yields 


log(Pn) 
1 
E{g(pn)} = g(p) + 8'(P)E{Pn — p} + 58 (P) . 
1 
- E{(py — pY}+ 58 (PED — py}, 


where | p* — p| < |p, — p|. Moreover, 


B= 2 + log(p) 
p? log*(p)’ 
gp) = »dloa(p) + 3+ log’(p) 


p? log 4(p) 
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Furthermore, 
2p—1 
E Dn = ’ 
(Pa P= FO 
F (n— 1)pU — p)+ 1/4 
E{(pn — pY} = : 
{Pn — py} 41? 
Hence, 
i 2p -—1 1(—1)pd — p)+ 1/4 
BO,) = 5 5 F 
2(n+ l)plog*(p) 2 (n+ 1) 
ee o(<). asn — oO. 
p* log’(p) n 


5.2.17 X),...,X,~ GA, v), v > 3 fixed. 
| 
(i) X ~ —G(A, nv). Hence 
n 


E all _ wae * nv-39-bX dy 
x2 T(nv) 0 


Mn? 


~ (nv — Dinv — 2) 


The UMVU of a2 is {2 = “Y= VOr=% | 41 


n2 x 
2(2nv — 5) 
Gv — 3) — 4)" 
nv 


diii) [,() = 2 The Cramér—Rao lower bound for the variance of the 


(ii) V{A2} = a4 


5 
UMVU of 2? is 


v nv 
a hea eee eee 


: n 
(iv) 'Q) = = 3 


w(A) = 2, w/(A) = 2A, w"(A) = 2. 


(22, av) = AA, va): 
2: 1 
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The second order BLB is 


44 (nv +1) 
(nv) 


and third and fourth order BLB do not exist. 


5.3.3. This is continuation of Example 5.8. We have a sample of size n of 


xX 
vectors (X, Y), where Ga ~N (0. (| ‘aD We have seen that 


je 2: 
V{o} = mt We derive now the variance of 67 =(Qy+ Qy)/2n, 
n—- 


where Qx=) X? and Qv=) Y7. Note that Qy | Qx ~ o7(1— 
i=1 i=] 
2 


2,2]. 
p°)x E x — p> or], Hence, 


E(Qy | Oxy) =07n(1 — p?)+.0°p"* Ox 


and 


2 
V(Qy | Ox) = 04 — p’y? (2n+4? Ox ). 
1— pp? 


It follows that 


471 — 92 
(eer, 


Finally, since (Qx + Qy, Pxy) is a complete sufficient statistic, and since 6 
is invariant with respect to translations and change of scale, Basu’s Theorem 
implies that 6? and / are independent. Hence, the variance—covariance 
matrix of (67, A) is 


204(1 — p”) 0 


V=- 1—p 


Thus, the efficiency of (67, 6), according to (5.3.13), is 


o4(1+ p*) — 0’)? — 04 p(1 — p’*? 
2041 = p2P/(1 = }) 


1 1 1 1 
(: = +o( ) asn —> OOo. 
2, n 2 n 


eff. = 
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1 xX — 
X,,..., Xs arei.i.d. having a Laplace distribution with p.d.f. —@ ( “), 
o o 


where —oo < pp < 00, 0 <0 <0, and A(x) = xe —0 <x <M. 
The standard c.d.f. (u = 0,0 = 1) is 


se, —oo <x <0, 
1- se", O<x<@. 


Fa)=| 


Let X(1) < Xa) < X(3) < X(4) < X(s5) be the order statistic. Note that M, = 
X). Also Xj) = w+ oU@),i = 1,...,5, where Ui) are the order statistic 
from a standard distribution. 

The densities of U1), U3), Us) are 


5 

Pay) = 5 exP(— lu) —F(u))', —00 <u <0, 

pau) = 1S exp(—|u|)\(F))° (1 — FW)’, 00 < u < 00, 
5 

Pou) = 5 exp(—luF Wy, —0O <uU<O. 


Since @(u) is symmetric around u = 0, F(—u) = 1 — F(u). Thus, p3)(u) is 
symmetric around u = 0, and Uj) ~ —Us). It follows that a} = E{Uq)} = 
—as and a3 = 0. Moreover, 


= —1.58854. 


Accordingly, a’ = (— 1.58854, 0, 1.58854), VU} = VU} = 
1.470256, V{U3)} = 0.35118. 
cov(Uy, Ug) = E{Ua)Ue)} 
5!” as 
= ae / y exp(—lyl)(F(Q) — FQ) - 
—0o x 
(1 — F(y))’dy dx = 0.264028 
cov(U1), Uo)) = cov(U(1), —Uay) = —V{Ua} = —1.470256 


cov(U(3), Uo) = E{Ug), —Uqy} = —0.264028. 


Thus, 


1.47026 0.26403 —1.47026 
V=]| 0.26403 0.35118 —0.26403 
—1.47026 —0.26403 1.47026 
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Let 
1 —1.58854 
A=] 1 0 
1 1.58854 
The matrix V is singular, since Us) = —Uq). We take the generalized 
inverse 


0.7863 —0.5912 0 
Vo =] —0.5912 3.2920 0 
0 0 0 


We then compute 


/ _ —| / ae | = 0 1 0) 
ey Ge 0.6295 a 


According to this, the estimator of w is & = X(3), and that of o is 6 = 
0.63.X (3) — 0.63.X(1). These estimators are not BLUE. Take the ordinary 
LSE given by 


. "Ayla! 3(Xqy + Xa) + Xs) 
~ ) = (AAD 1A’ Xe | = 3 
(5) (A’A) 2 (age + 2.31475X 5) 


then the variance covariance matrix of these estimators is 


0.0391 —0.0554 
—0.0554 0.5827 }° 


Thus, V{f} = 0.0391 < V{fi} = 0.3512, and V{é} = 0.5827 > V{é} = 
0.5125. Due to the fact that V is singular, (i is a better estimator than (2. 


5.6.6 (x;,7;),i=1,...,n.7T; ~ (a+ Bx,)GU, l,i =1,...,n. 


n 


L@, B;x,T) =|] 


i=l 


sel abn 

at Bx; a+ BX; 

l(a, B) = log L(a, 6B; X,T) = Se (Iosta + 8x0 + 5") 
ee a a + BX; 


a : ; 
Bre ay Serer 2 ea pxe 


i=1 


n 


0 Xi X;T; 
~ 5gl B= Dy Bx, EGTA 
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The MLEs of @ and £ are the roots of the equations: 
n 1 n ip 
I = 


mPa Se 


The N Suiof Raphvon method for approximating numerically (a, 6) is 


n 


T; — (a + BX;) 
Gi(a, B) =) 
d, (a+ BXiP 


“. T,X; — X;(a + BXi) 
G2(a, B) = >> CET TA: 
i=1 


The matrix 


dG, 0G, 
= Ja ap _ “wi uw; X; 
D(a, B)= 0G2 dG2 — (se iw; X? ; 
da op 
a + BX; — 2T; 
where w; = ———___— 
(a + BX;) 


|D(a, B)| = (Zw;)(Bw;:X7) — (Lwi Xi)” 


= uw; Xj; 
= (Xwj) (S» (x:- Ew; a 


We assume that | D(a, 6)| > 0 in each iteration. Starting with (a, 6), we 
get the following after the /th iteration: 


Ai41\) (a) -1f Gila, Br) 
SF) (D(a, Bi) eo) 


The LSE initial solution is 


DT; (X; — X) 
x(X; — X)?’ 
Q= T = BiX. 


Bi = 
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k 
J, —nP,(6)Y ; 
5.69 X?= pp cee For the Hardy—Weinberg model, 


Ln Pi(6) 


3 


x26) = ‘3 (J; — nP;(6) 


— onP,(8) 
(A n@ YP (2-20 - OY (Bh - nd - 0 
~ 62 2n6(1 — 0) n(l — 6)2 
Gee ING) 
do* = ano — oy 


where 


N(@) = 224? + 20 — A — hy — JDO? 
— 3(457 — J3)6? + (12d? — 330 — 4/7. 


Note that J3 = n — J, — Jo. Thus, the MCE of 6 is the root of N(@) = 0. 


5.7.1 X ,...,X, areiid. R(O,0),0 <0 < o. 
(i) cX ~ R(O, c@) for all 0 < c < oo. Thus, the model is preserved under 
the scale transformation G. 
(ii) The m.s.s. is X(,). Thus, an equivariant estimator of 0 is 


(Xa) = XW). 
Consider the following invariant loss function: 


A 6-6) 
16,6) =! =— 


There is only one orbit of G in ©. Thus, find w to minimize 


Ob) = EX(W Xn — 17} = WE(XG)) — 2WE(Xn) + 1, 
O'() = 2 E(X(,)) — 2E(Xq). 


9 E(Xw) 
Thus, y" = ——— computed at @ = 1. Note that under @ = 1, X(,) ~ 
E(X(,)) 
Bei DE Vata pe 
n, n Se ae a ee es = . 
eed oy n+2 n+1 
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5.7.3 


(i) The Pitman estimator of the location parameter in the normal case, 
according to Equation (5.7.33), is 


00 —1W2+ EY +u)?) 


ue i=2 du 
B= Xa - — =a 
OS (W+ DN +uy’) 
e i=2 du 
—0o 


u2 + eG + uy = uz + Yow + 2u)* Yu) 
i=2 i=2 i=2 


2 
n n n 
1 1 
) 2 2 ) ) 
+ Yi =n Uu + au Yu) + (: v0) 
i=2 i=2 


i=2 
1 n 2 n 
2 
i=2 i=2 


Moreover, 


and 


2 
me i —l1< 
vn f u exp =o (+2 ) Ks) du = — an ) Yu). 
od i=2 i=2 


The other terms are cancelled. Thus, 


n 


1 
a=Xy+-— X (i) — X, 
fh ate d @ — Xay) 
1 1 


n 
==-Xiy+— Xu) = Xp. 
a (1) =m (i) n 


This is the best equivariant estimator for the squared error loss. 
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(ii) For estimating o in the normal case, let (X, S) be the m.s.s., where 


. 1 
S= (= Six 5X *) . An equivariant estimator of o, for the 
i=l 
translation-scale group, is 


-(4+ X,—X 


yo 5 
of u. We find w(u) by minimizing E{((Sw — 1)?} under o = 1. E(S? | 
W) = E\{S*} = land E\{S | vy} = E,{S}.Here, y° = E,{S}/E\{S"}. 


) By Basu’s Theorem, S is independent 


CHAPTER 6 


Confidence and Tolerance Intervals 


PART I: THEORY 
6.1 GENERAL INTRODUCTION 


When @ is an unknown parameter and an estimator @ is applied, the precision of 
the estimator 6 can be stated in terms of its sampling distribution. With the aid of 
the sampling distribution of an estimator we can determine the probability that the 
estimator 0 lies within a prescribed interval around the true value of the parameter 0. 
Such a probability is called confidence (or coverage) probability. Conversely, for 
a preassigned confidence level, we can determine an interval whose limits depend 
on the observed sample values, and whose coverage probability is not smaller than 
the prescribed confidence level, for all 6. Such an interval is called a confidence 
interval. In the simple example of estimating the parameters of a normal distribution 
N(2, 07), a minimal sufficient statistic for a sample of size n is (X,,, S?). We wish to 
determine an interval (4(X,,, S?), i(X;, S2)) such that 


Pyo{U(Xn, S2) < bw < U(Xn, S2)} = 1-a, (6.1.1) 


for all jz, 0. The prescribed confidence level is 1 — a and the confidence interval is 
(14, [L). It is easy to prove that if we choose the functions 


petits 
Jn 
Sn 

oe, 


Ti 


then (6.1.1) is satisfied. The two limits of the confidence interval (LL, 4) are called 


the lower and upper confidence limits. Confidence limits for the variance o7 in 


Xn, Sp) = Xa — tala — 1 
(6.1.2) 
U(Xn, 2) = Xn + taaln — 1 
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the normal case can be obtained from the sampling distribution of S?. Indeed, since 


2 
S? a — x°[n — 1]. The lower and upper confidence limits for 07 are given by 
2 (n-1)8? 
~ XP ln — 1’ 
(6.1.3) 
_»_ (n—1)8? 
Xenin — 1) 


A general method to derive confidence intervals in parametric cases is given in 
Section 6.2. The theory of optimal confidence intervals is developed in Section 6.3 
in parallel to the theory of optimal testing of hypotheses. The theory of tolerance 
intervals and regions is discussed in Section 6.4. Tolerance intervals are estimated 
intervals of a prescribed probability content according to the unknown parent distri- 
bution. One sided tolerance intervals are often applied in engineering designs and 
screening processes as illustrated in Example 6.1. 

Distribution free methods, based on the properties of order statistics, are developed 
in Section 6.5. These methods yield tolerance intervals for all distribution functions 
having some general properties (log-convex for example). Section 6.6 is devoted to 
the problem of determining simultaneous confidence intervals for several parameters. 
In Section 6.7, we discuss two-stage and sequential sampling to obtain fixed-width 
confidence intervals. 


6.2 THE CONSTRUCTION OF CONFIDENCE INTERVALS 
We discuss here a more systematic method of constructing confidence intervals. 

Let F = {F(x;0), 60 € ©} be a parametric family of d.f.s. The parameter @ is real 
or vector valued. Given the observed value of X, we construct a set S(X) in © such 
that 

Po{O € S(X)} => 1—a, forall 0. (6.2.1) 

S(X) is called a confidence region for @ at level of confidence 1 — aw. Note that 

the set S(X) is a random set, since it is a function of X. For example, consider the 

multinormal N(0, 1) case. We know that (X — 0)/(X — 6) is distributed like y7[k], 
where k is the dimension of X. Thus, define 

S(X) = {0 : (X-—0) (K-48) < Xi_alk]}- (6.2.2) 


It follows that, for all 0, 


Po € S(X)} = P{(X—0)(X—-6)< x7 ,{k}=1—-a. 6.2.3) 
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Accordingly, S(X) is a confidence region. Note that if the problem, in this multinormal 
case, is to test the simple hypothesis Hp : 8 = 00 against the composite alternative 
H, : 0 F 00 we would apply the test statistic 


T (00) = (KX — 00) (XK — 0), (6.2.4) 


and reject Ho whenever T(69) > x7_, [kK]. This test has size a. If we define the 
acceptance region for Ho as the set 


A(00) = {X; (X — 80)(K — 80) < x7_glkI}, (6.2.5) 


then Hp is accepted if X € A(@o). The structures of A(@9) and S(X) are similar. In 
A(6o), we fix @ at 6 and vary X, while in S(X) we fix X and vary 6. Thus, let 
A = {A(0); 0 € ©} be a family of acceptance regions for the above testing problem, 
when @ varies over all the points in ©. Such a family induces a family of confidence 
sets S = {S(X) : X € 1} according to the relation 


S(X) = {0 : X € A@); AQ) € A}. (6.2.6) 


In such a manner, we construct generally confidence regions (or intervals). We first 
construct a family of acceptance regions, A for testing Ho : 6 = 0 against H) : 
6 # @ at level of significance a. From this family, we construct the dual family 
S of confidence regions. We remark here that in cases of a real parameter 0 we 
can consider one-sided hypotheses Ho : 0 < @ against H, : 0 > 0; or Hy : 86 > % 
against H, : @ < 09. The corresponding families of acceptance regions will induce 
families of one-sided confidence intervals (—00, 9(X)) or (@(X), 00), respectively. 


6.3 OPTIMAL CONFIDENCE INTERVALS 


In the previous example, we have seen two different families of lower confidence 
intervals, one of which was obviously inefficient. We introduce now the theory of 
uniformly most accurate (UMA) confidence intervals. According to this theory, the 
family of lower confidence intervals 6, in the above example is optimal. 


Definition. A lower confidence limit for 0, 9(X) is called UMA if, given any other 
lower confidence limit 9*(X), 


Po{O(X) < 6} < PofO*(X) < 6} (6.3.1) 
forall 6' < 0, and all é. 
That is, although both the @(X) and 9*(X) are smaller than 6 with confidence 


probability (1 — a), the probability is larger that the UMA limit @(X) is closer to 
the true value 6 than that of @*(X). Whenever a size w uniformly most powerful 
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(UMP) test exists for testing the hypothesis Hp : 0 < @ against H, : 0 > 0, then 
a UMA (1 — @)-lower confidence limit exists. Moreover, one can obtain the UMA 
lower confidence limit from the UMP test function according to relationship (6.2.6). 
The proof of this is very simple and left to the reader. Thus, as proven in Section 4.3, 
if the family of d.f.s F is a one-parameter MLR family, the UMP test of size a, of 
Hy : 0 < 6 against H; : 0 > 6, is of the form 


1, if T, > Ca(O), 
$°(Tn) = 4 Va, if Tr = Co (60), (6.3.2) 


0, if otherwise, 


where T,, is the minimal sufficient statistic. Accordingly, if T,, is a continuous random 
variable, the family of acceptance intervals is 


A = {(—00, Cy (9)), 0 € O}. (6.3.3) 
The corresponding family of (1 — a)-lower confidence limits is 
S = {(,,00);T, = Cy(@,), 9 € O}. (6.3.4) 
In the discrete monotone likelihood ratio (MLR) case, we reduce the problem to 
that of a continuous MLR by randomization, as specified in (6.3.2). Let T,, be the 
minimal sufficient statistic and, without loss of generality, assume that 7,, assumes 
only the nonnegative integers. Let H/,(t;@) be the cumulative distribution function 
(c.d.f.) of T,, under 6. We have seen in Chapter 4 that the critical level of the test 
(6.3.2) is 


C.() = least nonnegative integer ¢ such that H,(t;@)) >1—a@. (6.3.5) 


Moreover, since the distributions are MLR, C,,(@) is a nondecreasing function of 6. 
In the continuous case, we determined the lower confidence limit @, as the root, 6, 
of the equation 7;,, = C,(@). In the discrete case, we determine @,, as the root, 0, of 
the equation 


Ay(T, — 1,0) + RL An (Tn; 0) — An(Tn — 150)] = 1 — a, (6.3.6) 


where R is arandom variable independent of 7,, and having a rectangular distribution 
R(O, 1). We can express Equation (6.3.6) in the form 


RH, (T,59,,) + A — R)AnTn — 158,) = 1— a. (6.3.7) 


If UMP tests do not exist we cannot construct UMA confidence limits. However, 
we can define UMA-unbiased or UMA-invariant confidence limits and apply the 
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theory of testing hypotheses to construct such limits. Two-sided confidence intervals 
(@,,(X), 6,(X)) should satisfy the requirement 


P,{0,,(X) < 0 < O,(X)} > 1—a, forall 0. (6.3.8) 


A two-sided (1 — a) confidence interval (@,(X), §,(X)) is called UMA if, subject to 
(6.3.8), it minimizes the coverage probabilities 


Po{O(X) < 6 < 6,(X)}, forall 6, 40. (6.3.9) 


In order to obtain UMA two-sided confidence intervals, we should construct a UMP 
test of size a of the hypothesis Hp : 6 = 6 against H, : 6 ~ 6. Such a test generally 
does not exist. However, we can construct a UMP-unbiased (UMPU) test of such 
hypotheses (in cases of exponential families) and derive then the corresponding 
confidence intervals. 

A confidence interval of level 1 — @ is called unbiased if, subject to (6.3.8), it 
satisfies 


Po{O,(X) < 01 < A(X)} < 1—a, forall 0, 40. (6.3.10) 


Confidence intervals constructed on the basis of UMPU tests are UMAU (uniformly 
most accurate unbiased) ones. 


6.4 TOLERANCE INTERVALS 


Tolerance intervals can be described in general terms as estimated prediction inter- 
vals for future realization(s) of the observed random variables. In Example 6.1, 
we discuss such an estimation problem and illustrate a possible solution. Consider 
a sequence X 1, X2,... of independent and identically distributed (i.i.d.) random 
variables having a common distribution F(x;0), 9 € ©. A p-content prediction 
interval for a possible realization of X, when 6 is known, is an interval (/,(@), up(@)) 
such that Ps[X € (/,(@), up(@))] = p. Such two-sided prediction intervals are not 
uniquely defined. Indeed, if F~'(p;6) is the pth quantile of F(x;6@) then for every 
O<ex<1,l,p= F-'(e(. — p);@) and Up = F'1-d-eod- P);0) are lower 
and upper limits of a p-content prediction interval. Thus, p-content two-sided pre- 
diction intervals should be defined more definitely, by imposing further requirement 
on the location of the interval. This is, generally, done according to the specific 
problem under consideration. We will restrict attention here to one-sided prediction 
intervals of the form (—0oo, F~'(p;6)] or [F~'(1 — p;), 00). 

When @ is unknown the limits of the prediction intervals are estimated. In this sec- 
tion, we develop the theory of such parametric estimation. The estimated prediction 
intervals are called tolerance intervals. Two types of tolerance intervals are dis- 
cussed in the literature: p-content tolerance intervals (see Guenther, 1971), which 
are called also mean tolerance predictors (see Aitchison and Dunsmore, 1975); 
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and (1 — a) level p-content intervals, also called guaranteed coverage tolerance 
intervals (Aitchison and Dunsmore, 1975; Guttman, 1970). p-Content one-sided tol- 
erance intervals, say (—oo, L,(X,,)), are determined on the basis of n sample values 
X, = (X1,..., X,) so that, if Y has the F(x; @) distribution then 


PolY < L,(X,)] = p, forall 6. (6.4.1) 
Note that 
PolY < Lp (X%n)] = Eo{PolY < Lyp(Xn) | Xn}. (6.4.2) 


Thus, given the value of X,,, the upper tolerance limit L ,(X,,) is determined so that the 
expected probability content of the interval (—oo, L,(X,,)] will be p. The (p, 1 — a) 
guaranteed coverage one-sided tolerance interval (—oo, Ly, »(X,,)) are determined so 
that 


Po[F'(p;0) < Lo,p(Xn)] = 1—a, (6.4.3) 


for all 6. In other words, Ly,,(X,) is a (1 — @)-upper confidence limit for the pth 
quantile of the distribution F(x; 6). Or, with confidence level (1 — a), wecan state that 
the expected proportion of future observations not exceeding Lg, p(X;,) is at least p. 
(p, 1 — a)-upper tolerance limits can be obtained in cases of MLR parametric families 
by substituting the (1 — w)-upper confidence limit 6, of 6 in the formula of F~!(p; 0). 
Indeed, if F = {F(x;6);6 € ©} is a family depending on a real parameter 6, and F is 
MLR with respect to X, then the pth quantile, F~!(p; 0), is an increasing function of 
0, for each 0 < p < 1. Thus, a one-sided p-content, (1 — a)-level tolerance interval 
is given by 


La, p(Xn) = F'(p; Ou(Xn)). (6.4.4) 


Moreover, if the upper confidence limit 6,(X,,) is UMA then the corresponding 
tolerance limit is a UMA upper confidence limit of F~!(p; @). For this reason such a 
tolerance interval is called UMA. For more details, see Zacks (1971, p. 519). 

In Example 6.1, we derive the (8, | — @) guaranteed lower tolerance limit for 
the log-normal distribution. It is very simple in that case to determine the 6-content 
lower tolerance interval. Indeed, if (Y,,, Ss) are the sample mean and variance of the 
corresponding normal variables Y; = log X; (i = 1, ..., 7) then 


‘ : 1 
iPr, So) = ¥, — tein — 11S,f1 + — (6.4.5) 
n 


412 CONFIDENCE AND TOLERANCE INTERVALS 


is such a 6-content lower tolerance limit. Indeed, if a N(jz, 0) random variable Y is 
independent of (Y,, S2) then 


7 7 1 1/2 
PuotY = Ip(Vn, Si} = Puol(Y = Y,)/ (s ; (1 + -) = —tg[n —I}}= B, 


(6.4.6) 


1/2 


2 
n—-1 
xtn-t . It is interesting 


n—-1 
to compare the 6-content lower tolerance limit (6.4.5) with the (1 — a, 6) guaranteed 
coverage lower tolerance limit (6.4.6). We can show that if 8 = 1 — a then the two 
limits are approximately the same in large samples. 


a 1 
since ¥ Fy ~ N (0,0? (1+ > and S, ~ o 
n 


6.5 DISTRIBUTION FREE CONFIDENCE AND 
TOLERANCE INTERVALS 


Let F be the class of all absolutely continuous distributions. Suppose that X;,..., X, 
are iid. random variables having a distribution F(x) in F. Let Xq) < --- < Xq be 
the order statistics. This statistic is minimal sufficient. The transformed random 
variable Y = F(X) has a rectangular distribution on (0, 1). Let x, be the pth quantile 
of F(x), ie., Xp = F—'(p), 0 < p < 1. We show now that the order statistics Xi) 
can be used as (p, y) tolerance limits, irrespective of the functional form of F(x). 
Indeed, the transformed random variables Y(;, = F(X) have the beta distributions 
BU,n -—i+1),i =1,...,n. Accordingly, 


P[Yiy > pl= P[X (i = F-'(p)] = Th_p(a -—i+1,i), i=1,...,n. (65.1) 


Therefore, a distribution free (p, y ) upper tolerance limit is the smallest X(;) satisfying 
condition (6.5.1). In other words, for any continuous distribution F(x), define 


i? = least j > 1 such that )_,(n -—j +1, j)>y. (6.5.2) 


Then, the order statistic X(jo) is a (p, y)-upper tolerance limit. We denote this by 
L,,,(X). Similarly, a distribution free (p, y )-lower tolerance limit is given by 


Lp,y (x) = Xo) where i° = largest j > 1 such that h_pU.n-j+)e2y. 
(6.5.3) 


The upper and lower tolerance intervals given in (6.5.2) and (6.5.3) might not exist if 
n is too small. They could be applied to obtain distribution free confidence intervals 
for the mean, jz, of a symmetric continuous distribution. The method is based on the 
fact that the expected value, jz, and the median, F -1(0.5), of continuous symmetric 
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distributions coincide. Since Jp.5(a, b) = 1 — Ip.5(b, a) for all 0 < a, b < cw, we 
obtain from (6.5.2) and (6.5.3) by substituting p = 0.5 that the (1 — ~) upper and 
lower distribution free confidence limits for jz are jig and L, where, for sufficiently 
large n, 


fla = Xj), where j = least positive integer,k, such that 


Ilosm—-k+1,k+1)>1-a/2; 


(6.5.4) 


and 


L, = Xw; where i = largest positive integer, k, such that 
(6.5.5) 
Iosn—k+1,k+1) <a/2. 


Let F be a log-convex distribution function. Then for any positive real numbers 
a gat aes ay, 


i=l i=l 


—log (: —F (x «x0)) <-—) aj log(l — F(X), (6.5.6) 


or equivalently 


F (x oxo) < 1—exp ps a; log(1 — rxwp| (6.5.7) 
i=1 


i=l 
Let 
G(X) = log(l = F(X()), i= 1, re A (6.5.8) 


Since F(X) ~ R(O, 1) and — log(1 — R(O, 1)) ~ GC, 1). The statistic G(X(j) is dis- 
tributed like the ith order statistic from a standard exponential distribution. Substitute 
in (6.5.7) 


r r 
Y\aiX@ = > A(X) — XG—1) 
i=l i=l 
and 


~ aj G(X) = > A\(G(X@) — G(XG-1)), 


i=l i=1 


414 CONFIDENCE AND TOLERANCE INTERVALS 


r 
where A; = Yai, i=1,...,r and Xo) = 0. Moreover, 
jr 


1 
G(Xw) — GXe-y) ~ ——F GOV, G1... (6.5.9) 
7 


i+ 
Hence, if we define 


_ 2log(1 — p) 


(n—-i+1), i=1,...,r, 
Xi_gl2r] 


then, from (6.5.7) 


Palak ty SP {- e Aji(Xi) — a) < r} 


i=1 


>P f — exp {- x Ai(G(X (i) — 6-0] = r} 


i=1 


=P ps A\(G(X@) — G(X) < — log - p| 
i=] 


= P{x?[2r] < x7_,[2r]} =1-a, 
(6.5.10) 


since 2d in —i+1(G(X%j@) — G(XG-1)) ~ x?[2r]. This result was published first 


i=1 
by Barlow and Proschan (1966). 


6.6 SIMULTANEOUS CONFIDENCE INTERVALS 


It is often the case that we estimate simultaneously several parameters on the basis of 
the same sample values. One could determine for each parameter a confidence interval 
at level (1 — a) irrespectively of the confidence intervals of the other parameters. The 
result is that the overall confidence level is generally smaller than (1 — aw). For 
example, suppose that (X,,..., X,) is a sample of n i.i.d. random variables from 
N(u, 07). The sample mean X and the sample variance S* are independent statistics. 
Confidence intervals for 4 and for o, determined separately for each parameter, are 


= - Se vs S 
I(X, S) = (x _ t—a/2[n _ ie X+ h—e2[n = =) 
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1/2 1/2 

L(S) 5 n—1 5 n—1 

2; = A> a... ao ’ a pee +S 
XPgaln — x7_glt — 1 


respectively. These intervals are not independent. We can state that the probability 
for ys to be in 1,(X, S) is (1 — @) and that of o to be in y(S) is (1 — a). But, what 
is the probability that both statements are simultaneously true? According to the 
Bonferroni inequality (4.6.50) 


and 


Pyoth € 1(X, 8), 0 € h(S)} > 1 — Puotu ¢ 1X, S)} — Purto ¢ b(S)} 
1—2a, forall p,o. (6.6.1) 


We see that a lower bound to the simultaneous coverage probability of (uw, o) 
is according to (6.6.1), 1 — 2a. The actual simultaneous coverage probability of 
1,(X, S) and 1,(S) can be determined by evaluating the integral 


Xi-q2ln—1 oe 
P(o)= | ® (11-waln - n=} 8n(x)dx —(1—a), (6.6.2) 
x 


aln—1 n(n — 1) 


where g,,(x) is the probability density function (p.d.f.) of x7[n — 1] and ®(.) is the 
standard normal integral. The value of P(o’) is smaller than (1 — a). In order to make it 
at least (1 — w), we can modify the individual confidence probabilities of [;(X, S) and 
of I2(S) to be 1 — a/2. Then the simultaneous coverage probability will be between 
(1 — a) and (1 — a@/2). This is a simple procedure that is somewhat conservative. 
It guarantees a simultaneous confidence level not smaller than the nominal (1 — @). 
This method of constructing simultaneous confidence intervals, called the Bonferroni 
method, has many applications. We have shown in Chapter 4 an application of this 
method in a two-way analysis of variance problem. Miller (1966, p. 67) discussed 
an application of the Bonferroni method in a case of simultaneous estimation of k 
normal means. 

Consider again the linear model of full rank discussed in Section 5.3.2, in which the 
vector X has a multinormal distribution N(AB, o7/). Aisann x p matrix of full rank 
and B is a p x 1 vector of unknown parameters. The least-squares estimator (LSE) 
of a specific linear combination of B, say 4 = a’ B, isk = a’ B = a'(A'A)!A’X. We 
proved that 1 ~ N(a’B, 07a'(A’A)~!a). Moreover, an unbiased estimator of o? is 


1 
67 = ——X'(I — A(A'A)1AYX, 
n—p 
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2 


2 
where G~ ~ as x7[n - Pp]. Hence, a (1 — a) confidence interval for the particular 
n—p 


parameter A is 


>> 
u 


E tain — plo(a'(A’A)'a)'”?. (6.6.3) 


Suppose that we are interested in the simultaneous estimation of all (many) linear 


combinations belonging to a certain r-dimensional linear subspace 1 < r < p. For 
P 


example, if we are interested in contrasts of the B-component, then 1 = Ya; 6i 

i=l 
where Xa; = 0. In this case, the linear subspace of all such contrasts is of dimension 
r = p—1.Let L be anr x p matrix with r row vectors that constitute a basis for 
the linear subspace under consideration. For example, in the case of all contrasts, the 
matrix L can be taken as the (p — 1) x p matrix: 


Every vector « belonging to the specified subspace is given by some linear combina- 
tion o’ = y’L. Thus, o/(A’A)~!o = y’L(A’A)~!L’y. Moreover, 


LB ~ N(LB, 0? L(A'A)"!L’) (6.6.4) 
and 
(B — BYL'(L(A'A)'L'Y LB — B) ~ 0° x7Ir, (6.6.5) 
where r is the rank of L. Accordingly, 
a8 — PLLA LLB B)~rFir.n—pl 666) 
and the probability is (1 — q) that B belongs to the ellipsoid 


E.(B, 67, L) = {8 : (B — BYL'(L(A'A)'L’)'L(B — B) < 76? Fy_olr,n — p)}. 
(6.6.7) 


E,(B, 07, L) isa simultaneous confidence region for all w’B at level (1 — aw). Consider 
any linear combination 4 = a'B = y’LB. The simultaneous confidence interval for 
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2. can be obtained by the orthogonal projection of the ellipsoid E,(B, 67, L) on the 
line / spanned by the vector y. We obtain the following formula for the confidence 
limits of this interval 


>> 
im 


t (r F\_g[r,n — p])?6(y7'L(A'A) I Ly)”, (6.6.8) 


where A = y'LB = a’ B. We see that in case of r = | formula (6.6.8) reduces to 
(6.6.3), otherwise the coefficient (7 F,_,[r, n — p))'? is greater than t)_y/2[n — p]. 
This coefficient is called Scheffés S-coefficient. Various applications and modifica- 
tions of the S-method have been proposed in the literature. For applications often used 
in statistical practice, see Miller (1966, p. 54). Scheffé (1970) suggested some mod- 
ifications for increasing the efficiency of the S-method for simultaneous confidence 
intervals. 


6.7 TWO-STAGE AND SEQUENTIAL SAMPLING FOR FIXED WIDTH 
CONFIDENCE INTERVALS 


We start the discussion with the problem of determining fixed-width confidence 
intervals for the mean yu of anormal distribution when the variance o? is unknown and 
can be arbitrarily large. We saw previously that if the sample consists of n 1.i.d. random 
variables X;,..., X,, where n is fixed before the sampling, then a UMAU confidence 


- S 
limit for jz are given, in correspondence to the t-test, by X + tj_g/2[n — 1] oi , where 
n 


X and S are the sample mean and standard deviation, respectively. The width of this 
confidence interval is 


S 
A* = 2t)-a[n — vr (6.7.1) 


Although the width of the interval is converging to zero, as n —> ov, for each fixed 
n, it can be arbitrarily large with positive probability. The question is whether there 
exists another confidence interval with bounded width. We show now that there is no 
fixed-width confidence interval in the present normal case if the sample is of fixed 
size. Let I;(X, S) be any fixed width interval centered at fi(X, S), i.e., 


1s(X, S) = (A(X, S) — 8, A(X, S) + 4). (6.7.2) 
We show that the maximal possible confidence level is 


sup inf Py.o{u € Is(X, S)} =0. (6.7.3) 
a he 
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This means that there is no statistic f(X, S) for which I;(X, S) is a confidence 
interval. Indeed, 


sup inf Py o{ € I5(X, S)} < lim inf sup Pyo{u € 1s(X, S)}. (6.7.4) 
fl L,o o-oo wu a 


In Example 9.2, we show that mne4 , S) = X isa minimax estimator, which maximizes 
the minimum coverage. Accordingly, 


7 2 = 6 
inf sup Puo{u € Is{(X, S)} = Ps {X —8 <w<X+5}=20 (Sv) —1. 
Lh mn (oy 
(6.7.5) 
Substituting this result in (6.7.4), we readily obtain (6.7.3), by letting 0 > oo. 
Stein’s two-stage procedure. Stein (1945) provided a two-stage solution to this 
problem of determining a fixed-width confidence interval for the mean jz. According 


to Stein’s procedure the sampling is performed in two stages: 


Stage I: 


(i) Observe a sample of 1, i.i.d. random variables from N(, o”). 
(ii) Compute the sample mean X,,, and standard deviation S,,. 
(iii) Determine 


S2 
2 
eis. 2 aaln = nF (6.7.6) 


where [x] designates the integer part of x. 
(iv) If N > n, go to Stage IJ; else set the interval 


15(Xn,) = (Xn, = 5, Xn, ai 6). 
Stage II: 
(i) Observe N2 = N —n, additional iid. random variables from N(j1, 07); 
Yi,..., Yn. 


(ii) Compute the overall mean Xy = (1; Xn, + N2Yy,)/N. 
(iii) Determine the interval I;(Xy) = (Xv — 6, Xy + 8). 


The size of the second stage sample N. = (N — n1)t is arandom variable, which 
is a function of the first stage sample variance ed . Since X,,, and Br are independent, 
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X,,, and N> are independent. Moreover, Yy, is conditionally independent of Si , given 
N>. Hence, 


Pyo{|Xw — | < 6} = E{Pyo{lXw — ul <6 | N}} 


-ps(58)- 


=> QE {° (; Fi Pa ah _ u1)| 1 (6.7.7) 
o 6 
N(0, 1) 
see Ey ee eee een ee 
han aye | 


This proves that the fixed width interval J;(Xy) based on the prescribed two-stage 
sampling procedure is a confidence interval. The Stein two-stage procedure is not 
an efficient one, unless one has good knowledge of how large n, should be. If o7 is 
known there exists a UMAU confidence interval of fixed size, i.e., J; (X03) where 


2 
n(8)=1+ ad . (6.7.8) 


If n, is close to n°(8) the procedure is expected to be efficient. n°(8) is, however, 
unknown. Various approaches have been suggested to obtain efficient procedures of 
sampling. We discuss here a sequential procedure that is asymptotically efficient. 
Note that the optimal sample size n°(5) increases to infinity like 1/57 as 5 > 0. 
Accordingly, a sampling procedure, with possibly random sample size, N, which 
yields a fixed-width confidence interval /;(X jy) is called asymptotically efficient if 


50 n%(5) 


(6.7.9) 


Sequential fixed-width interval estimation. Let {a,,} be a sequence of positive num- 
bers such that a, > XP ws [1] asn — oo. Wecan set, for example, a, = Fi\_,[1, 2] for 
alln > n; anda, = © forn < n,. Consider now the following sequential procedure: 


1. Starting with n = n, i.i.d. observations compute X,, and Ss? 
2. Ifn > a, S /6° stop sampling and, estimate ju by I;(X,,); else take an additional 


independent observation and return to (i). Let 


N(6) = least n > nj, such that n > a,S?/8°. (6.7.10) 
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According to the specified procedure, the sample size at termination is N(4). N(6) 
is called a stopping variable. We have to show first that N(6) is finite with probability 
one, i.e., 


lim Py.¢{N(6) > n} = 0, (6.7.11) 
n—->oo 


for each 6 > 0. Indeed, for any given n, 


n 


ik fe 
Pyo{N(8) > n} = Pye } ( ) 15? > — 


es aj 
fH (6.7.12) 
> ns 
< Pug \S, > a 
But 
nb2 S2 
P {si > }=p|% >]. (6.7.13) 
an n an 


as. 
> 0 asn — ov, therefore 


2 2 
tim? {2 > ©) — tim P 


n> oo n An 


2 2 
-1 5 
ASE os is =0 (6.7.14) 
n—1 o*x;_gL1] 
asn — oo. Thus, (6.7.11) is satisfied and N(6) is a finite random variable. The present 
sequential procedure attains in large samples the required confidence level and is also 
an efficient one. One can prove in addition the following optimal properties: 


(i) If a, = a for all n > n; then E,{N(8)} < no(6) +n, +1, forallo?. (6.7.15) 


This obviously implies the asymptotic efficiency (6.7.9). It is, however, a much 
stronger property. One does not have to pay, on the average, more than the equivalent 
of n; + 1 observations. The question is whether we do not tend to stop too soon and 
thus lose confidence probability. Simons (1968) proved that if we follow the above 
procedure, n; > 3 and a, = a for all n > 3, then there exists a finite integer k such 
that 


Py o{|\Xwix — wl <5} > 1a, (6.7.16) 
for all uw, o and 6. This means that the possible loss of confidence probability is not 


more than the one associated with a finite number of observations. In other words, if 
the sample is large we generally attain the required confidence level. 
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We have not provided here proofs of these interesting results. The reader is referred 
to Zacks (1971, p. 560). The results were also extended to general classes of dis- 
tributions originally by Chow and Robbins (1965), followed by studies of Starr 
(1966), Khan (1969), Srivastava (1971), Ghosh, Mukhopadhyay, and Sen (1997), 
and Mukhopadhyay and de Silva (2009). 
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Example 6.1. It is assumed that the compressive strength of concrete cubes fol- 
lows a log-normal distribution, LN (wu, o”), with unknown parameters (i, 0). It is 
desired that in a given production process the compressive strength, X, will not 
be smaller than &) in (1 — 6) x 100% of the concrete cubes. In other words, the 
B-quantile of the parent log-normal distribution should not be smaller than &), where 
the B-quantile of LN(j, 07) is Xp = exp{u + zgo}, and zg is the B-quantile of 
N(O, 1). We observe a sample of n i.i.d. random variables X,,..., X, and should 
decide on the basis of the observed sample values whether the strength requirement 
is satisfied. Let ¥; = log X; (i = 1,..., 7). The sample mean and variance (Y,,, S), 


1 < Z 
where Ne = rep aes _ Vay constitute a minimal sufficient statistic. On the 
= 
i=l 
basis of (Y,,, Sy; we wish to determine a (1 — a)-lower confidence limit, x, p to the 
unknown f-quantile xg. Accordingly, x, , should satisfy the relationship 


Pu {Xap < xg} >1—a, forall (u,o). 


X,g is called a lower (1 — a, 1 — 8) guaranteed coverage tolerance limit. If 
Xq.g 2 0, We say that the production process is satisfactory (meets the specified 
standard). Note that the problem of determining x, , is equivalent to the problem 
of determining a (1 — )-lower confidence limit to 4 + zgo. This lower confidence 
limit is constructed in the following manner. We note first that if U ~ N(O, 1), then 


= U + /n z1-2 
(x7[n — 1]/( — 1)? 


Val¥n — (ue + 2p0)1/Sn wt — 14/0 2-9); 


where f[v; 6] is the noncentral t-distribution. Thus, a (1 — aw)-lower confidence limit 
for + zpo is 


7 we 
= Y, —ti_,[n — 1; Vn z1-,]—= 
1p i-al Jn 1-61 


and Xap = exp{7 a is a lower (1 — a, 1 — f)-tolerance limit. | 
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Example 6.2. Let X,,..., X,, be i.i.d. random variables representing the life length 
1 
of electronic systems and distributed like G (3. 1) We construct two different 


(1 — a)-lower confidence limits for 0. 
(i) The minimal sufficient statistic is J, = .X;. This statistic is distributed like 


6 
5 x7 [2n]. Thus, for testing Hp : 0 < 4 against H, : 0 > Op at level of significance 
a, the acceptance regions are of the form 


> 
A(60) = ) Tn: Th < = Xj_gI2n]7¢,. 0 <  < oo. 


The corresponding confidence intervals are 


S(T,) = ec ae 
MN 2 fn] J 


The lower confidence limit for @ is, accordingly, 


Oy = 2T/Xj_ql2n). 


0 
(ii) Let Xq) = min {X;}. X(1) is distributed like > x°[2]. Hence, the hypotheses 
sisn n 


Hy : 0 < 6 against H; : 9 > 0 can be tested at level a by the acceptance regions 


r) 
A'(0) = {Xo ee Fx? 42 . 0<& <a. 


These regions yield the confidence intervals 


7 2nX 1) 
S (Xay) =|10:0> 7) - 
Xi_el2] 


The corresponding lower confidence limit is 6/, = 2nX 1)/x7_,[2]. Both families 
of confidence intervals provide lower confidence limits for the mean-time between 
failures, 0, at the same confidence level 1 — a. The question is which family is 


more efficient. Note that 6, is a function of the minimal sufficient statistic, while 


2n@ 
0! is not. The expected value of 6, is Eo{@,} = aula This expected value is 


Xj_ol27] 
approximately, as n — oo, 


Ent0q) = 6/ (1+ 22) 20 (1-58 +0(2)). as n —> 00. 
n n 
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Thus, E{@,} is always smaller than 0, and approaches @ as n grows. On the other 
hand, the expected value of 6, is 


2nX 1) 20 6 
Eo| > a = 
Xi-l2] Xi-gl2] —loga 


This expectation is about 6/3 when a = 0.05 and 6/4.6 when a = 0.01. It does 
not converge to 6 as n increases. Thus, 6/, is an inefficient lower confidence 


limit of 0. | 
Example 6.3. 
A. Let X1,..., X, beii.d. N(O, o”) random variables. We would like to construct 


the UMA (1 — a)-lower confidence limit of «?. The minimal sufficient statistic 


is T, = DX?, which is distributed like 07 x7[n]. The UMP test of size a of 


Hy :0? < ae against H, : o> OG iS 


$°(Tn) = {Tn = O68 X7_y In}. 


Accordingly, the UMA (1 — @)-lower confidence limit 7 is 


a = n/ Xi-al”I. 
B. Let X ~ B(n, 0), O < 6 < 1. We determine the UMA (1 — @)-lower confi- 
dence limit of the success probability 6. In (2.2.4), we expressed the c.d_f. 


of B(n, @) in terms of the incomplete beta function ratio. Let R be a random 
number in (0, 1), independent of X, then @, is the root of the equation 


Rh_o,a—X,X+)+0-Rho,n—-X+1,X)=1-a, 


provided 1 < X <n—1.If X =O, the lower confidence limit is 6, (0) = 0. 
When X = n the lower confidence limit is 0,(n) = a!/". By employing the 
relationship between the central F’-distribution and the beta distribution (see 
Section 2.14), we obtain the following for X > 1 and R = 1: 


X 
ae 
Re” x LAT Ee I | 


If X > land R = Othe lower limit, 9’, is obtained from (6.3.11) by substituting 
(X — 1) for X. Generally, the lower limit can be obtained as the average 
RO, + (1 — R)@!. In practice, the nonrandomized solution (6.3.11) is often 
applied. 

a 
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Example 6.4. Let X and Y be independent random variables having the normal 
distribution N(0, o7) and N(0, po), respectively. We can readily prove that 


W(0, p) = Pa2,{X?+¥* <1} =1-E {? (1 =3)} 


p 2 
the c.d.f. of the Poisson distributions with mean A. y(a*, p) is the coverage probability 
of a circle of radius one. We wish to determine a (1 — a)-lower confidence limit for 
W(o?, p), on the basis of m independent vectors, (X1, Y1),...,(Xn, Yn), when pe 


1 1 
where J has the negative binomial distribution NB { 1 — —, ;). P(j;A) designates 


is known. The minimal sufficient statistic is Ty, = Dx? + —DY?. This statistic is 
p 


distributed like 0? x?[2n]. Thus, the UMA (1 — a)-upper confidence limit for o 
Of = Ton/Xq[2n]. 


The Poisson family is an MLR one. Hence, by Karlin’s Lemma, the c.d.f. P(j; 1/207) 


is an increasing function of o? for each j = 0, 1,.... Accordingly, if o? < oe then 
1 1 
P(j; 1/207) < P(j;1/262). It follows that E } P| J; )} < E}P(J;— }}. 
202 262 


From this relationship we infer that 


we@2.p)=1-E|P (Ji) 
263 


is a (1 — a)-lower confidence limit for (a7, p). We show now that w(a2, p)isa 
UMA lower confidence limit. By negation, if (62, p) is not a UMA, there exists 
another (1 — a) lower confidence limit, v, say, and some 0 < yw’ < Wo’, p) such 
that 


PIW(6Z, p) < W} > Pi, < v'}. 


1 
The function P ( D3 =) is a strictly increasing function of 7. Hence, for each p 
o 


there is a unique inverse o3(W) for w(o”, p). Thus, we obtain that 
PolOZ =O, (W)} > Polos) = ow), 


where ow’ )<o’. Accordingly, oh, ) is a (1 — @)-upper confidence limit for o. 
But then the above inequality contradicts the assumption that 62 is UMA. a 
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Example 6.5. Let X,,..., X, be i.i.d. random variables distributed like N(w, 0”). 
The UMP-unbiased test of the hypotheses 


Ao: w= Lo, o” arbitrary 


against 
MN: uF bo, o arbitrary 
is the t-test 
LX — Hol/n 
= f ——— _ > t_<[n-1 
rp ae foe ee a 5 > h-s[n — 1], 


0, otherwise, 


where X and S are the sample mean and standard deviation, respectively. Corre- 
spondingly, the confidence interval 


- SS . S 
(x — t-a/2[n — Mia X + t—o/2[n — =) 


isa UMAU at level (1 — a). | 


Example 6.6. In Example 4.11, we discussed the problem of comparing the binomial 
experiments in two clinics at which standard treatment is compared with a new (test) 
treatment. If X;; designates the number of successes in the jth sample at the ith clinic 
(Gi = 1,2; 7 = 1, 2), we assumed that X;; are independent and X;; ~ B(n, 6;;). We 
consider the cross-product ratio 


— A101 = O12) [A211 — 422) 
(1 — 01)012 / (1 — 021)022 


In Example 4.11, we developed the UMPU test of the hypothesis Hp : p = 1 against 
A, : p # 1. On the basis of this UMPU test, we can construct the UMAU confidence 
limits of p. 

Let Y= Xi; T, = X11 + X12, T= Xr + Xx, and § = Xi + X. The con- 
ditional p.d.f. of Y given (7), 7, S) under p was given in Example 4.11. Let 
H(y | Ti, To, S) denote the corresponding conditional c.d.f. This family of condi- 
tional distributions is MLR in Y. Thus, the quantiles of the distributions are increasing 
functions of p. Similarly, H(y | T,, To, S) are strictly decreasing functions of e for 
each y = 0, 1,..., min(7,, S) and each (7), 7), S). 

As shown earlier one-sided UMA confidence limits require in discrete cases 
further randomization. Thus, we have to draw at random two numbers R; and R> 
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Table 6.1 0.95—Confidence Limits for the Cross-Product Ratio 


ni N ng N> Y T; Ty S pP p 

32 112 78 154 5 15 17 18 .1103 2.4057 
20 40 20 40 5 20 30 20 .0303 1.2787 
25 50 25 50 15 25 27 22 5.8407 169.4280 
20 50 20 50 15 25 27 22 5.6688 164.2365 
40 75 30 80 33 43 25 48 .9049 16.2156 


independently from a rectangular distribution R(O, 1) and solve simultaneously the 
equations 


R, A(Y;T, Tr, S,p) +d —- RYAY -1;N, Th, S,p)=1-4, 
R,H(Y —1;T, Tn, S, 6) + — Ro)A(Y; 71), Th, S, P) = €2, 


where €; + €2 = a. Moreover, in order to obtain UMA unbiased intervals we have 
to determine ~, (6, €; and € so that the two conditions of (4.4.2) will be satisfied 
simultaneously. One can write a computer algorithm to obtain this objective. However, 
the computations may be lengthy and tedious. If 7), T and S are not too small we 
can approximate the UMAU limits by the roots of the equations 


A(Y;T;, Th, S,p)= 1-—a/2, 
H(Y;T,, To, S, 0) = 0/2. 


These equations have unique roots since the c.d.f. H(Y;7;, To, S, ¢) is a strictly 
decreasing function of o for each (Y, T;, 72, S) having a continuous partial derivative 
with respect to p. The roots p and 6 of the above equations are generally the 
ones used in applications. However, they are not UMAU. In Table 6.1, we present 
a few cases numerically. The confidence limits in Table 6.1 were computed by 
determining first the large sample approximate confidence limits (see Section 7.4) 
and then correcting the limits by employing the monotonicity of the conditional c.d_f. 
H(Y;7T, To, S, 0) in pe. The limits are determined by a numerical search technique on 
a computer. a 


Example 6.7. Let X,, X2,..., X, be i.i.d. random variables having a negative- 
binomial aeabunon NB, 6, v is known and 0 < w < 1. A minimal sufficient 


Statistic is JT, = wx, which has the negative-binomial distribution NB(w, nv). 
i=1 

Consider the f-content one-sided prediction interval [0,G~!(B;w, v)], where 

G~'(p;w, v) is the pth quantile of NB(W, v). The c.d.f. of the negative-binomial 
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distribution is related to the incomplete beta function ratio according to formula 
(2.2.12), i.e., 


Gixyyw,v)=h_yw,x+1), x=0,1,.... 
The pth quantile of the NB(w, v) can thus be defined as 
G~'(p;, v) = least nonnegative integer j such that h_y@,j+D2 p. 


This function is nondecreasing in y for each p and v. Indeed, F = {NB(w, v); 0 < 
w < 1}isan MLR family. Furthermore, since T,, ~ NB(y, nv), we can obtaina UMA 
upper confidence limit for ~, y, at confidence level y = 1 — a. A nonrandomized 
upper confidence limit is the root W, of the equation 


T_y,(2v,T, + b= 1—-a@. 


If we denote by B~'(p;a, b) the pth quantile of the beta distribution B(a, b) then yy 
is given accordingly by 


Yo = 1-6 '(@jnv, T + 1). 
The p-content (1 — «)-level tolerance interval is, therefore, [0, G~'(p; Wa, v)]. il 


Example 6.8. In statistical life testing families of increasing failure rate (IFR) are 
often considered. The hazard or failure rate function h(x) corresponding to an 
absolutely continuous distribution F(x) is defined as 


A(x) = f(x)/U — Fx), 


where f(x) is the p.d.f. A distribution function F(x) is IFR if h(x) is a nondecreasing 
function of x. The function F(x) is differentiable almost everywhere. Hence, the 
failure rate function h(x) can be written (for almost all x) as 


h(x) = au log(1 — F(x)). 
dx 


Thus, if F(x) is an IFR distribution, — log(1 — F(x)) is a convex function of x. A 
distribution function F(x) is called log-convex if its logarithm is a convex function 
of x. The tolerance limits that will be developed in the present example will be 
applicable for any log-convex distribution function. 
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Let X(1) < +--+ < X() be the order statistic. It is instructive to derive first a (p, 1 — 


1 
a)-lower tolerance limit for the simple case of the exponential distribution G @ i) ‘ 


1 
0 < 6 < o. The pth quantile of G (5. i) is 
F-'(p;0) = —@log(1—p), O<p<l. 


Let Th, = Sin —i + 1)(Xw — Xq_1) be the total life until the rth failure. 7,,,, is 


i=1 


é 
distributed like 5 x? [2r]. Hence, the UMA-(1 — w)-lower confidence limit for @ is 


2Thr 
alee et 
Xi-l2r] 


The corresponding (p, | — a) lower tolerance limit is 


PLE Soi 
fy gn) = og Dp . 
?. aD 


Example 6.9. The MLE of o in samples from normal distributions is asymptotically 
normal with mean o and variance o7/2n. Therefore, in large samples, 


P. 


Lo 


Fe ad) tn Dad 
{nS a) Kop = <x? (aif * 1-4, 


Oo Oo 


for all x, o. The region given by 


= 2. 2 
608.8) = |. arn (4) +2n (=) <x} 


is a simultaneous confidence region with coverage probability approximately (1 — @). 
The points in the region C,(X, S) satisfy the inequality 


- 2.2 ) 1/2 
x as[* Xi-al?l _ a¢ “| . 
n 
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Hence, the values of o in the region are only those for which the square root on the 
RHS of the above is real. Or, for all n > Hes (2]/2, 


S S 


<o< % 
/72-° = 1/2 
42] Xe 212) 
1+ | —— 1 — | ——— 
2n 2n 


Note that this interval is not symmetric around S. Let a, and G,, denote the lower and 
upper limits of the o interval. For each o within this interval we determine a jz interval 
symmetrically around X, as specified above. Consider the linear combination A = 
at + ago, where a; + a2 = 1. We can obtain a (1 — a)-level confidence interval for 
i from the region C,,(X, S) by determining two lines parallel to ajjz + ayo = 0 and 
tangential to the confidence region C,(X, S). These lines are given by the formula 
aL +ano =A, and ay = ayo = dy. The confidence interval is (A,, Aq). This 
interval can be obtained geometrically by projecting Cy(X, S) onto the line / spanned 
by (a1, az)); iLe., 1 = {(a1, paz); —0O < p < Ov}. a 
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Section 6.2 

6.2.1 Let X,,..., X, be iid. random variables having a common exponential dis- 
tribution, GZ, 1), 0 < 6 < ~w. Determine a (1 — @)-upper confidence limit 
ford =e-°. 

6.2.2 Let X,,...,X, be iid. random variables having a common Poisson 


distribution P(A), 0 < A < co. Determine a two-sided confidence inter- 
val for A, at level 1—a. [Hint: Let 7, = &X;. Apply the relationship 
Py{Ty < t} = P{x?[2t + 2] = 2nd}, t =0,1,... to show that (A,, Aq) is 


a (1 — a@)-level confidence interval, where 2, = on ‘x /212Tn +2] anda, = 
n 


1 
2 
Fy Xi-a/2l2Tn +2). 
6.2.3 Let X,,..., X, be i.i.d. random variables distributed like G(A, 1), 0 < 1 < 
oo; and let Yj,..., Y,, be i.i.d. random variables distributed like G(n, 1), 
0 < n < oo. The X-variables and the Y-variables are independent. Determine 
a (1 — w)-upper confidence limit for @ = (1 + 7/A)7! based on the statistic 


m 


ye 
ie =) 


6.2.4 Consider a vector X of n equicorrelated normal random variables, having zero 
mean, 4 = 0, and variance o [Problem 1, Section 5.3]; i.e., X ~ N(O, X), 


430 


6.2.5 


6.2.6 
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—1 
where = o7(1 — p)I +07pJ;0 < 07 < @, Soar < p < 1. Construct a 
n— 
(1 — a)-level confidence interval for o. [Hint: 
(i) Make the transformation Y = HX, where H is a Helmert orthogonal 


matrix; 


n 
(ii) Consider the distribution of Y ie if > ye 
i=? 


Consider the linear regression model 
Y,=fotPixite, i= 1,...,n, 


where e1,..., €, are iid. N(O, 07), X1,..-, Xn specified constants such that 
x(x; — X)* > 0. Determine the formulas of (1 — a)-level confidence limits 
for Bo, 6, and o”. To what tests of significance do these confidence intervals 
correspond? 


Let X and Y be independent, normally distributed random variables, X ~ 
N(&, 07) and Y ~ N(n, 03); —0co < —€ < w,0 < 9 < &, oj and o2 known. 
Let 6 = €/n. Construct a (1 — a)-level confidence interval for 6. 


Section 6.3 


6.3.1 


6.3.2 


6.3.3 


6.3.4 


Prove that if an upper (lower) confidence limit for a real parameter 0 is based 
ona UMP test of Hp : 6 > 0 (@ < %) against H, : 0 < 4 (@ > Op) then the 
confidence limit is UMA. 


Let X;,..., X, bei.i.d. having a common two parameter exponential distri- 
bution, 1.e., X ~ w+ GG. 1); -wo <u<w,0<B<o@. 

(i) Determine the (1 — a)-level UMAU lower confidence limit for jz. 

(ii) Determine the (1 — w)-level UMAU lower confidence limit for f. 

[Hint: See Problem 1, Section 4.5.] 


Let X1,..., X, be ii.d. random variables having a common rectangular 
distribution R(O, 0); 0 < 6 < oo. Determine the (1 — a)-level UMA lower 
confidence limit for 0. 


Consider the random effect model, Model II, of ANOVA (Example 3.9). 
Derive the (1 — a)-level confidence limits for 0? and t?. Does this system of 
confidence intervals have optimal properties? 


Section 6.4 


6.4.1 


Let X,,..., X, bei.i.d. random variables having a Poisson distribution P(A), 
0 <A < ow. Determine a (p, 1 — a) guaranteed coverage upper tolerance 
limit for X. 
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6.4.2 


Consider the normal simple regression model (Problem 7, Section 5.4). 

Let € be a point in the range of controlled experimental levels x1,..., xy 

(regressors). A p-content prediction limit at € is the point n, = Bo + Bié + 

ZpO. 

(i) Determine a (p, | — a) guaranteed upper tolerance limit at &, i.e., deter- 
mine /, .(&) so that 


P, <b +B ] .f1 ,€-E\"? 
1 | + Bs BPE eT Binet (; * “Spx ) | 


=1-a, forall 6 = (fo, fi, 0). 


(ii) What is the form of the asymptotic (p, 1 — a)-level upper tolerance limit? 


Section 6.5 


6.5.1 


6.5.2 


6.5.3 


6.5.4 


Consider a symmetric continuous distribution F(x — 4), —oO <  < OO. 
How large should the sample size n be so that (Xi), X(n—i+1)) is a distribution- 
free confidence interval for ju, at level | — a = 0.95, when 


(i) i = 1, (i) i = 2, and (ii) i = 3. 


Apply the large sample normal approximation to the binomial distribution 
to show that for large size random samples from symmetric distribution the 
(1 — @)-level distribution free confidence interval for the median is given by 
(Xw, X (n—i+1))s where i = [S = +/n Z1—al (David, 1970, p. 14). 


How large should the sample size n be so that a (p, y) upper tolerance limit 
will exist with p = 0.95 and y = 0.95? 


Let F(x) be a continuous c.d.f. and Xj) < +--+ < X(,) the order statistic of 
a random sample from such a distribution. Let F~'(p) and F ~h(@), with 
0 < p <q <1, be the pth and qth quantiles of this distribution. Consider 
the interval Eg = (F-'(p), F-'(q)). Let p <r <s <n. Show that 


V = P{Epg C (Xm, Xo} 
s—r-1 


n! ; pri ; 
ei baa PN EU stli,s—r-— jf). 
j=0 


Ifg = 1— 6/2 and p = 6/2 then (X(,), X(s)) is: a1 — B, y) tolerance inter- 
val, where y is given by the above formula. 
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Section 6.6 


6.6.1 


6.6.2 


In a one-way ANOVA k = 10 samples were compared. Each of the samples 
consisted of n = 10 observations. The sample means in order of magnitude 
were: 15.5, 17.5, 20.2, 23.3, 24.1, 25.5, 28.8, 28.9, 30.1, 30.5. The pooled 
variance estimate is 5, = 105.5. Perform the Scheffé simultaneous testing to 
determine which differences are significant at level a = 0.05. 


n = 10 observations Y;; (i = 1,...,3; 7 = 1, ...,) were performed at three 
values of x. The sample statistics are: 


x, =0 xX, = 1.5 x3 = 3.0 
Y 5.5 9.7 17.3 
SDY 13.7 15.8 14.5 


(i) Determine the LSEs of Bo, 61, and o” for the model: Y; 7 = Bo + Bix + 
e;;, where {e;;} are iid. N(O, a”). 

(ii) Determine simultaneous confidence intervals for E{Y} = Bp + 61x, for 
all 0 < x < 3, using the Scheffé’s s-method. 


Section 6.7 


6.7.1 


6.7.2 


6.7.3 


6.7.4 


Let X,, X2,... be a sequence of i.i.d. random variables having a common 

log-normal distribution, L N (1, 07). Consider the problem of estimating € = 

exp{j1 + 07/2}. The proportional-closeness of an estimator, &, is defined as 

Po{\& —&| < AE}, where A is a specified positive real. 

(i) Show that with a fixed sample procedure, there exists no estimator, E, such 
that the proportional-closeness for a specified A is at least y,0 < y < Ll. 

(ii) Develop a two-stage procedure so that the estimator €y will have the 
prescribed proportional-closeness. 


Show that if F is a family of distribution function depending on a location 
parameter of the translation type, i.e., F(x;@) = Fo(x — 0), -wO <0 < MN, 
then there exists a fixed width confidence interval estimator for 0. 


Let X,,..., X, be iid. having a rectangular distribution R(0,0),0 <0 < 
2. Let X(n) be the sample maximum, and consider the fixed-width interval 
estimator [3(X(n)) = (Xin), Xin) + 5), 0 < 6 < 1. How large should n be so 
that Pe{@ € I3(X(n))} = 1 — a, for all 9 < 2? 


Consider the following three-stage sampling procedure for estimating the 
mean of a normal distribution. Specify a value of 5,0 < 5 < oo. 
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(i) Take a random sample of size n,. Compute the sample variance Se If 
n> (a? /5?)S? where a? = AN terminate sampling. Otherwise, 


ny? 


add an independent sample of size 


a 2 
No = 5 | +1—-—n. 


(ii) Compute the pooled sample variance, Ss 4Np" If ny +tN> 


(a?/ 5°) $2 ,y, terminate sampling; otherwise, add 


a 
N3 = | F581 +1- (ny oF N2) 


independent observations. Let VN = n; + No + N3.Let X vy be the average 
of the sample of size N and [s(X,) = (Xn — 6, Xy +4). 

(i) Compute Py{u € I3(Xy)} for 6 = (u, 0). 

(ii) Compute Ey {N}. 
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6.2.2 X,...,X,areiid. P(A). Tj = > X; ~ P(na) 
i=l 
P{T, < t} = P(t;na) 
= 1—- P(Gd,t+1) < nd) 
= P{y7[2t + 2] > 2nd}. 
The UMP test of Hp: A > Ao against Hy: 2 < Ao is P(T,) = TTh < ty). 


Note that P(x?[2ty+2]>2nd9)=a if 2ndo = x24 [2ty +2]. For 


—_ Xqy2l2Tr + 21 : 


two-sided confidence limits, we have 2X and Ay = 


“it 2n 
Xt-a/2l2Tn +2] 
2n , 


6.2.4 Without loss of generality assume that 0? = 1 


X~ NO, (1 — p)l + pJ), 


1 
where ea < p < 1,nis the dimension of X, and J = 1,1. 
n— 
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(i) The Helmert transformation yields Y = HX, where 
Y~ NO, HU — p)I + pJ)H’). 


Note that A((1—p)l + pJ)H’ = diag(1 — e)+np,(1— p),..., 
(1 — p)). 
(i) YP ~ (1 — p)+np)x7H] and SOY? ~ A — p)xz[n — 1], where x? 


i=l 


and xsIn — 1] are independent. Thus, 


2(n — 
w= Aa ~ (14 gp ) Flin 
p 


yy 
j=2 


Hence, for a given0 <a <1, 
np 
P es ee Fy2li,n—1])< Ww 
=) 


< (1 4 ~~) Fiahlle u| ees 
l1—p 


1 
Recall that Fy/2[1,n — 1] = : 
F\-aj2[n — 1, 1] 


Let 


1 
Riva — —(WF}_¢/2[n 1, 1] rr 1) 
n 


1 (W— F.-apll,n— 1 
n Fi_-a2[l,n — 1] 


Rio = 


Since p/(1 — p) is a strictly increasing function of p, the confidence 
limits for p are 


a 1 Ria 
Lower limit = max : — ), 
n-1 14+ Riw 


n,a 


U limit = ————. 
pper limit = + ry ae 


6.2.6 The method used here is known as Fieller’s method. Let U = X — SY. 
Accordingly, U ~ N(0, o; + 5°03) and 


(xX —6Y/ ‘ 
ee 1]. 
o7 + 8of xT 
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It follows that there are two real roots (if they exist) of the quadratic equation 
in 6, 


8°(y? — x?_[1]o7) — 26XY + (X? — x?_,[1]o?) = 0. 
These roots are given by 


; XY \Y|oy 
1S os a 81-02 
ses g Ca OO nae ceo a 


1/2 
14% RAS eye. ft] 
ey y? 


It follows that if Y? > oF xe AN the two real roots exist. These are the 
confidence limits for 6. 


6.4.1 The m.s.s. ford is T, = ye A p-quantile of P(A) is 


i=1 


Xp(A) = min{j > 0: P(j;A) 2 p} 


= min{j > 0: x7_,[2j + 2] = 2a}. 


Since P(A) is an MLR family in X, the (p, 1 — w) guaranteed upper tolerance 
limit is 


La,p(Tn) = X palin); 


" 1 
where 1,(7,,) = ee Xo AZT +2] is the upper confidence limit for i. 
n 
Accordingly, 


1 1 
Loe(In) = min} j =O: x7 pl27+2]1> —xi gl2% + 2]}. 
2 2n 


6.5.1 (i) Since F is symmetric, if i = 1, then j = n. Thus, 


Ips(1,n + 1) = 0.975, 


n+l | 
fsdn+l)= >> ( . Nee 
ja \ J 
=1- > 0.975. 


gn+l — 
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Or, 
—(n + 1) log2 < log(0.025), 
—i : 
. o0g(0.025) _ 1. Hens: 
log(2) 
“5 n—-2 
(ii) Ips(2,n) =1—- Fel > 0.975, 
n+2 
oral, < 0.025. 


1 9 
Forn = 8, = 0.0195. For n = 7, x 0.0352. Thus, n = 8. 


1 1 1/2 
(iii) hen eae aes 2 . 0.975 


(n+1)(n+2) 
SS Ap ge > 0.975 
— Qntl S704 : 


2 1 2 
+a+ Dat ) < 0.005. 
gnt2 


2+11x 12 
For n= 10, we get ——aia = 9.0327. For n= 11, 
2+12~x 13 


aig 0.0193. Thus, n = 11. 


we get 


6.5.2 nit (04 


losin +2-i)= >> 


ji 


For large n, by Central Limit Theorem, 
n+l 


1 
tostsn+2-1)= oo (jin +15) 


a 2 
j=i 


are i-5-(4+1)/2 Saws 
= ara > a/s2. 
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Thus, 


n 1 n 1 
j=] NY, lyon == - = ays 
I + > 5 n Z1-a/2 5 Vn z 


n 1 
st 5vnt 1Z1-a/2, 


a ee 5 
—~n 1 
= 5 tN Z 1-4/2. 


6.7.1 Let Y; = log X;, i= 1,2,.... If we have a random sample of fixed size 


a. are ~ Ir 
n, then the MLE of & is &, = exp;Y, + =6°}, where Y, = -\ OY; and 
2 ea 


ieee _ 
x2 os Y; — Y, 2 
6 nok ) 


(i) For a given A, 0 < A < 1, the proportional closeness of &,, is 


1 S 1 
PC =P {a — A)exp fa 5° | < exp {¥, + 56 | 
| 5 
<(1+A)exp u+ 50 
> l 8 2 
= P jlog1—A)< (%, — mw) + 56 , — 0°) <log(1+A)p. 
% s ‘3 = 1 a2 9: 
For large values of n, the distribution of W, = (Y, — “)+ 5 On —o-) 


a2 me) 
is approximately, by CLT, NV (0 — (: + =). 
n 
od oe 
lim P {toe —-A)<N (0 — (1 + =) < log + a} 
o?>00 n 2 


ate be log(1 + A)./n 2 log(1 — A)./n = 
ee ae fon /1+ 07/2 o/1+o2/2 = 


Hence, there exists no fixed sample procedure with PC > y > 0. 


(ii) Consider the following two-stage procedure. 
Stage I. Take a random sample of size m. Compute Y,, and Go. Let 
6 = log(1 + A). Note that 6 < —log(1 — 4). If o were known, then the 
proportional closeness would be at least y if 


2 2 o\2 
5 ayo? + 9 
—— 
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Accordingly, we define the stopping variable 


2711621 & 
y= | eee = aes 


% = 1 
If {N < m} stop sampling and use é,, = exp {Fa + 5 a | On the other 


hand, if {N > m} go to Stage II. 

Stage II. Let N2 = N — m. Draw additional N> observations, condition- 
ally independent of the initial sample. Combine the two samples and 
A ‘ = 1 
compute Yy and Gn. Stop sampling with &y = exp {Fr + ON . The 


distribution of the total sample size N,,, = max{m, N} can be determined 
in the following manner. 


we (Ny =m} = Pyo 187(1 4+62/2) < 
o2VtVin = =F o2 oP < 
m 6 o*/ ell 
_ forx?im— 1). ot? fn - 1)? més? 
-r| en a 2(m — 1)2 x20] . 
(m — 1)()/1 + 2md?/x311] — 1) 
=P)x*[m—1] < 3 
Oo 


For] =m+1,m+2,... let 


(m — 1)(,/1 + 2182/x2[1] — 1) 


o2 


Am (J) = , 


then 


P{Nm = 1} = P{am@ — 1) < x7 lm — 1] < An}. 


CHAPTER 7 


Large Sample Theory for Estimation 
and Testing 


PART I: THEORY 


We have seen in the previous chapters several examples in which the exact sampling 
distribution of an estimator or of a test statistic is difficult to obtain analytically. Large 
samples yield approximations, called asymptotic approximations, which are easy 
to derive, and whose error decreases to zero as the sample size grows. In this chapter, 
we discuss asymptotic properties of estimators and of test statistics, such as consis- 
tency, asymptotic normality, and asymptotic efficiency. In Chapter 1, we presented 
results from probability theory, which are necessary for the development of the the- 
ory of asymptotic inference. Section 7.1 is devoted to the concept of consistency 
of estimators and test statistics. Section 7.2 presents conditions for the strong con- 
sistency of the maximum likelihood estimator (MLE). Section 7.3 is devoted to the 
asymptotic normality of MLEs and discusses the notion of best asymptotically nor- 
mal (BAN) estimators. In Section 7.4, we discuss second and higher order efficiency. 
In Section 7.5, we present asymptotic confidence intervals. Section 7.6 is devoted 
to Edgeworth and saddlepoint approximations to the distribution of the MLE, in the 
one-parameter exponential case. Section 7.7 is devoted to the theory of asymptoti- 
cally efficient test statistics. Section 7.8 discusses the Pitman’s asymptotic efficiency 
of tests. 


7.1 CONSISTENCY OF ESTIMATORS AND TESTS 


Consistency of an estimator is a property, which guarantees that in large samples, the 
estimator yields values close to the true value of the parameter, with probability close 
to one. More formally, we define consistency as follows. 
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Definition 7.1.1. Let {O,3n =no,no + 1,...} be a sequence of estimators of a 


parameter 0. 6, is called consistent if 6, —. 6 asn — oo. The sequence is called 
strongly consistent if 0, — 0 almost surely (a.s.) asn — o0 for all 0. 


Different estimators of a parameter 6 might be consistent. Among the consistent 
estimators, we would prefer those having asymptotically, smallest mean squared error 
(MSE). This is illustrated in Example 7.2. 

As we shall see later, the MLE is asymptotically most efficient estimator under 
general regularity conditions. 

We conclude this section by defining the consistency property for test functions. 


Definition 7.1.2. Let {¢,} be a sequence of test functions, for testing Hy : 0 € Oo 
versus H, : 0 € ©,. The sequence {¢,} is called consistent if 


(i) lim sup E{on(Xn)} <a,0<a<1 
NOX Geo 
and 


(ii) lim Ey{Gn(X,)} = 1, forall 6 € ®. 
n—->Oo 


A test function ¢, satisfying property (i) is called asymptotically size w test. 

All test functions discussed in Chapter 4 are consistent. We illustrate in Example 
7.3 a test which is not based on an explicit parametric model of the distribution F(x). 
Such a test is called a distribution free test, or a nonparametric test. We show that 
the test is consistent. 

As in the case of estimation, it is not sufficient to have consistent test functions. 
One should consider asymptotically efficient tests, in a sense that will be defined 
later. 


7.2 CONSISTENCY OF THE MLE 


The question we address here is whether the MLE is consistent. We have seen in 
Example 5.22 a case where the MLE is not consistent; thus, one needs conditions for 
consistency of the MLE. Often we can prove the consistency of the MLE immediately, 
as in the case of the MLE of 0 = (jz, 7) in the normal case, or in the Binomial and 
Poisson distributions. 

Let X1, X2,..., Xn, ... be independent identically distributed (i.i.d.) random vari- 
ables having a p.d.f. f(x;0), 0 € ©. Let 


1(6;Xn) =) log f(Xi3 9). 


i=1 
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If @ is the parameter value of the distribution of the Xs, then from the strong law of 
large numbers (SLLN) 


*16:X,) —1(80;Xn) > —1@o,), (7.2.1) 
n 


as n — oo, where /(4,0@) is the Kullback—Leibler information. Assume that 
1(0o, 6’) > 0 for all 6’ 4 6. Since the MLE, 6, maximizes the left-hand side of 
(7.2.1) and since I(69, 89) = 0, we can immediately conclude that if © contains 
only a finite number of points, then the MLE is strongly consistent. This result is 
generalized in the following theorem. 


Theorem 7.2.1. Let X,,..., X, be i.i.d. random variables having a p.af. f (x; 9), 
6 € ©, and let 6 be the true value of 0. If 


(i) © is compact; 
(ti) f (x;@) is upper semi-continuous in 0, for all x; 
(iii) there exists a function K(x), such that Eg.{|K(X)|} < co and log f(x;6) — 
log f(x; 00) < K(x), for all x and @; 


(iv) forall@ € © and sufficiently small5 > 0, sup f(x; @) is measurable in x; 
|p—O|<6 


(v) f(x; 0) = f(x; 6) for almost all x, implies that 0 = 0 (identifiability); 
then the MLE 6, — 6, asn > ©. 
The proof is outlined only. For 6 > 0, let ©; = {0 : |@ — @| = 5}. Since © is 


compact so is @5. Let U(X; 0) = log f(X; 6) — log f(X; 4). The conditions of the 
theorem imply (see Ferguson, 1996, p. 109) that 


lim sup — 1S une) < < sup Ho} = ="1; 


NO ge@Q; N = 0€Os5 


where 14(8) = —I(00, 0) < 0, for all 9 € © 5. Thus, with probability one, for n suffi- 
ciently large, 


sup — ap U(X;;0) < ve w(0) < 0. 


6cO; 1 


But, 


Sud, ) = sup — + U0) 2 > 0. 


= beON 
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Thus, with probability one, for n sufficiently large, 6, — @o| < 5. This demonstrates 
the consistency of the MLE. 

For consistency theorems that require weaker conditions, see Pitman (1979, Ch. 8). 
For additional reading, see Huber (1967), Le Cam (1986), and Schervish (1995, 
p. 415). 


7.3 ASYMPTOTIC NORMALITY AND EFFICIENCY 
OF CONSISTENT ESTIMATORS 


The presentation of concepts and theory is done in terms of real parameter 0. The 
results are generalized to k-parameters cases in a straightforward manner. 

A consistent estimator 6(X,,) of @ is called asymptotically normal if, there exists 
an increasing sequence {c,}, C, 7 CO asin — ov, So that 


cn(O(X,) —0) > N(O,v2(6)), as n—> ©. (7.3.1) 


The function AV (n} = v?(@)/c? is called the asymptotic variance of 6(X,,). Let 


7] 

S(X,39) = se 30 log f(X;; 0) be the score function and /(@) the Fisher information. 
i=l 

An estimator 6, that, under the Cramér—Rao (CR) regularity conditions, satisfies 


Jn(6, — 0) = ——=S(Xn; 6) + 0,(1), (7.3.2) 


I or 


as n — ov, is called asymptotically efficient. Recall that, by the Central Limit 
1 

Theorem (CLT), ——S(X,,; 6) ty N(O, 1(@)). Thus, efficient estimators satisfying 
n 

(7.3.2) have the asymptotic property that 


1 


a d 


) , asn—> oo. (7.3.3) 


For this reason, such asymptotically efficient estimators are also called BAN estima- 
tors. 

We show now a set of conditions under which the MLE 6,, is a BAN estimator. 

In Example 1.24, we considered a sequence {X,, } of ii.d. random variables, 


with X,; ~ B(1, @), 0 < @ < 1. In this case, X, = Dye is a strongly consistent 
i=1 

estimator of 6. The variance stabilizing transformation g(X,) = 2sin7! /X, is a 

(strongly) consistent estimator of w = 2sin~! /@. This estimator is asymptotically 


= 1 
normal with an asymptotic variance AV{g(X,,)} = — for all w. 
n 
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Although consistent estimators satisfying (7.3.3) are called BAN, one can construct 
sometimes asymptotically normal consistent estimators, which at some @ values 


1 
have an asymptotic variance, with v?(0) < 76) Such estimators are called super 
efficient. In Example 7.5, we illustrate such an estimator. 
1 
Le Cam (1953) proved that the set of point on which v7(6) < 7D) has a Lebesgue 


measure zero, as in Example 7.5. 
The following are sufficient conditions for a consistent MLE to be a BAN esti- 
mator. 


C.1. The CR regularity conditions hold (see Theorem 5.2.2); 

C2. 9g 3OXn 0) is continuous in 0, a.s.; 

C.3. 6, exists, and S(X,,; 6,,) = 0 with probability greater than | — 6, 0 < 6 arbi- 
trary, for n sufficiently large. 


1 : 
CA. 2 ox, :6,) —, -1(6), asn— oo. 
n 00 


Theorem 7.3.1 (Asymptotic efficiency of MLE). Let 6, be an MLE of @ then, under 
conditions C.1.-C.4. 


Vn, —6) > N(O,1/1()), as n— oo. 


Sketch of the Proof. Let Bs, be a Borel set in 6” such that, for all X, € Bs.o.n, 
6, exists and S(X,; 6.) = 0. Moreover, Po(Bs.9.n) = 1 — 6. For X, € Bs,¢,n, consider 
the expansion 


. Pp 0 
S(Xn5 On) = SCK&n3A) + (On — 8) - 99 5%ns Ons 


where |0* — 6| < |6, — 4]. 
According to conditions (iii)-(v) in Theorem 7.2.1, and Slutzky’s Theorem, 
Z S(X; 0) 
Jit, — 8) = - “ N(, 1/1(6)), 
— + > S(Xn3 G7) 
n 00 


as n — ov, since by the CLT 
l s(X,:6) + N(, 1@)) 
— ns = 9 9 
Jn 


asn > ©. 
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7.4 SECOND-ORDER EFFICIENCY OF BAN ESTIMATORS 


Often BAN estimators 6, are biased, with 
K b 1 
E,o{6,}=O+—-+o(—-}), asnro>ow. (7.4.1) 
n n 


The problem then is how to compare two different BAN estimators of the same 
parameter. Due to the bias, the asymptotic variance may not present their precision 
correctly, when the sample size is not extremely large. Rao (1963) suggested to adjust 
first an estimator 6, to reduce its bias to an order of magnitude of 1/n?. Let 6* be the 
adjusted estimator, and let 


a 1 D 1 
Vo{Ox} = +—=z+0 , ano. (7.4.2) 
” ni(@) ni n2 


The coefficient D of 1/n? is called the second-order deficiency coefficient. Among 
two BAN estimators, we prefer the one having a smaller second-order deficiency 
coefficient. 

Efron (1975) analyzed the structure of the second-order coefficient D in expo- 
nential families in terms of their curvature, the Bhattacharyya second-order lower 
bound, and the bias of the estimators. Akahira and Takeuchi (1981) and Pfanzagl 
(1985) established the structure of the distributions of asymptotically high orders 
most efficient estimators. They have shown that under the CR regularity conditions, 
the distribution of the most efficient second-order estimator 07 is 


: _ 3i.2(0) + 243(8) 9 ca 
Plynl@) @; — 9) <1) = 0) + ro+o(—), 


(7.4.3) 
where 
Q2 
J\,2(0) = Eo {51x09 392 log rx:0)| ‘ (7.4.4) 
and 
J3(0) = Eo{S?(X;0)}. (7.4.5) 


For additional reading, see also Barndorff-Nielsen and Cox (1994). 
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75 LARGE SAMPLE CONFIDENCE INTERVALS 


Generally, the large sample approximations to confidence limits are based on the 
MLEs of the parameter(s) under consideration. This approach is meaningful in cases 
where the MLEs are known. Moreover, under the regularity conditions given in the 
theorem of Section 7.3, the MLEs are BAN estimators. Accordingly, if the sample 
size is large, one can in regular cases employ the BAN property of MLE to construct 
confidence intervals around the MLE. This is done by using the quantiles of the stan- 
dard normal distribution, and the square root of the inverse of the Fisher information 
function as the standard deviation of the (asymptotic) sampling distribution. In many 
situations, the inverse of the Fisher information function depends on the unknown 
parameters. The practice is to substitute for the unknown parameters their respective 
MLEs. If the samples are very large this approach may be satisfactory. However, as 
will be shown later, if the samples are not very large it may be useful to apply first a 
variance stabilizing transformationg(@) and derive the confidence limits of g(@). 
A transformation g(0) is called variance stabilizing if g’(@) = //(0). If 6, is 
an MLE of 6 then g(On) is an MLE of g(@). The asymptotic variance of gu) 
under the regularity conditions is (g/(0))* /nI(@). Accordingly, if g/(@) = /T(@) then 


the asymptotic variance of g(O,) is —. For example, suppose that X,,..., X, is a 
n 


sample of n i.i.d. binomial random variables, B(1, 0). Then, the MLE of @ is X,. 
The Fisher information function is J,(0) = n/@(1 — 6). If g(@) = 2sin“! J/@ then 


g'(0) = 1/./61 — 8). Hence, the asymptotic variance of g(X,) = 2sin7! /X, is a 
Transformations stabilizing whole covariance matrices are discussed in the paper of 
Holland (1973). 

Let 6 = f(g) be the inverse of a variance stabilizing transformation g(@), and 
suppose (without loss of generality) that t(g) is strictly increasing. For cases satisfying 
the BAN regularity conditions, if 6, is the MLE of 0, 


Jn(g(6,) — 2(0)) “> N(0,1), as n> oo. (7.5.1) 


A (1 — a) confidence interval for g(0) is given asymptotically by (g(@n) — Zi—a pln, 
g(On) + Z1-a/2//N), where Z1~9/2 = old — a/2). Let gz and gy denote these 
lower and upper confidence intervals. We assume that both limits are within the 
range of the function g(@); otherwise, we can always truncate it in an appropriate 
manner. After obtaining the limits g; and gy we make the inverse transformation on 
these limits and thus obtain the limits 6, = t(g,) and 6y = t(gy). Indeed, since f(g) 
is a one-to-one increasing transformation, 


Pol{O_ < 8 < Oy} = Polar < g(O) < gu} 


SAP « a(6,) < 900+ 42 1 


= Po {@) - Ui 


Thus, (6,, Oy) is an asymptotically (1 — w)-confidence interval. 
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7.6 EDGEWORTH AND SADDLEPOINT APPROXIMATIONS TO THE 
DISTRIBUTION OF THE MLE: ONE-PARAMETER CANONICAL 
EXPONENTIAL FAMILIES 


The asymptotically normal distributions for the MLE require often large samples to 
be effective. If the samples are not very large one could try to modify or correct the 
approximation by the Edgeworth expansion. We restrict attention in this section to 
the one-parameter exponential type families in canonical form. 

According to (5.6.2), the MLE, Wns of the canonical parameter y satisfies the 
equation 


‘ i : 
K'(n) = = Y U(X) = On. (7.6.1) 
i=l 


The cumulant generating function K(y) is analytic. Let G(x) be the inverse function 
of K’(w). G(x) is also analytic and one can write, for large samples, 


Vn = G(U,) 


; 1 
= G(K'(W)) + (On — KW) G'(K'(W)) + op (=) (7.6.2) 


: 1 
=v + (On — K'W))/K"Ch) + op (=) . 


Recall that K" (yw) = I() is the Fisher information function, and for large samples, 


: — Un K'W) 
Val) tn — 1) = Vn Tray tent: (7.6.3) 


Moreover, E{U,} = K'(w) and V{./n U,} = I(y). Thus, by the CLT, 


Jn Sa , N(0,1), as n> o0. (7.6.4) 
Equivalently, 
Jn (tn — W) aren, @ a) , ano. (7.6.5) 


This is a version of Theorem 7.3.1, in the present special case. 


PART I: THEORY 447 


If the sample is not very large, we can add terms to the distribution of 
Jni() (Wn — w) according to the Edgeworth expansion. We obtain 


P{/nl(h) hn — W) < x} = O@) ie (x7 — Dox) 
Ja 
x | Po—-3 5 Br 4 2 
; 7A (x 3) + ay 10x* + 15) P(x), 
(7.6.6) 
where 
KO) 
1= KOM?’ (7.6.7) 
and 
_ KOC) 


n 


Let J, = pe (X;). T,, is the likelihood statistic. As shown in Reid (1988) the 
i=l 


saddlepoint approximation to the p.d.f. of the MLE, wp, is 


8,083 W) = Cn(KO (a)? exp{—@ — WT — n(K (Wh) — K(x) + O(n”), 
(7.6.9) 


where c, is a factor of proportionality, such that / 84,003 W)du(x) = 1. 


Let L(@; X,,) and /(0; X,,) denote the likelihood and log-likelihood functions. Let 
6, denotes the MLE of 6, and 


e 1 @ 
Inn) = —= + 5103 Xn)|p_g « (7.6.10) 


n de? 
We have seen that E»{J,(0)} = 1(0). Thus, J,(@) —-> 1(@), asin — oo (the Fisher 
information function). J, (6,) is an MLE estimator of J,(@). Thus, if 6, mak 6, as 


n — ov, then, as in condition C.4. of Theorem 7.3.1, In(On) cate (0), asn > oo. 
The saddlepoint approximation to gg (x; @) in the general regular case is 


L(O; Xn) 


g4. (x30) = Can O ey (7.6.11) 


Formula (7.6.11) is called the Barndorff-Nielsen p*-formula. The order of magni- 
tude of its error, in large samples, is O(n~>/”). 
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7.7 LARGE SAMPLE TESTS 


For testing two simple hypotheses there exists a most powerful test of size a. We 
have seen examples in which it is difficult to determine the exact critical level ky of 
the test. Such a case was demonstrated in Example 4.4. In that example, we have 
used the asymptotic distribution of the test statistic to approximate k,. Generally, if 
X1,..., Xp are i.i.d. with common p.d.f. f(x; 6) let 


f (X51) 


R(X) = ———_., 
2 f(X; 00) 


(7.7.1) 


where the two sample hypotheses are Ho : 6 = 00 and H : 0 = 0;. The most pow- 
erful test of size w can be written as 


1, if } log R(X;) > ke, 
i=1 

ya, if ) log R(X;) = ka, 
i=1 

0, otherwise. 


Thus, in large samples we can consider the test function 6(S,) = I{S, > ka}, where 
S.= Y log R(X;). Note that under Ho, E¢,{S,} = —n(/(6@0, 01) while under A, 


i=l 
Eo, {Sn} = n1(01, 00), where 1(0, 0’) is the Kullback—Leibler information. 
Let or = V,, {log R(X1)}. Assume that 0 < Oe < oo. Then, by the CLT, 
Sn I(00, 0 
im Po { +nI(Oo ae 
n—>00 Jn 00 
kK, is 


x| = @(x). Hence, a large sample approximation to 


ky = —n1(00, 01) + Zi-aV/n 00. (7.7.2) 


The large sample approximation to the power of the test is 


(7.7.3) 


¥(0;,0;)=® (va Tone) Z1-0%) ; 


onl 
where oP = Vo, {log R(X1)}. Generally, for testing Ho : 6 = 00 versus H; : 0 4 0 
where @ is a k-parameter vector, the following three test statistics are in common use, 
in cases satisfying the CR regularity condition: 


1. The Wald Statistic 


Ow =nOn — 80) JOn)On — 90), (7.7.4) 
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where 6,, is the MLE of 0, and 


J(6,) = —H() (7.7.5) 


o=6,, 
Here, H(@) is the matrix of partial derivatives 


1 oF oe 
H(0) = — | —— ] X;30);i, 7 =1,...,k]. 
(0) * (saga; Soren 5 


An alternative statistic, which is asymptotically equivalent to Q,,, is 
Q* = nb, — 00) J(A0)(8, — 90). (7.7.6) 


One could also use the FIM, /(@0), instead of J(00). 
2. The Wilks’ Likelihood Ratio Statistic: 


Or = 2{l(On; Xn) — 180; Xn)}. (7.7.7) 


3. Rao’s Efficient Score Statistic: 


1 
Or = “Sn; 90) (Jo) 'S(Kn3 Oo), (7.7.8) 


where S(X,;4) is the score function, namely, the gradient vector 
n 


Vo», log f(X;; 0). Qr does not require the computation of the MLE 6,. 


i=1 


On the basis of the multivariate asymptotic normality of 6,,, we can show that all 
these three test statistics have in the regular cases, under Ho, an asymptotic x*[k] 
distribution. The asymptotic power function can be computed on the basis of the 
non-central x2[k; A] distribution. 


7.8  PITMAN’S ASYMPTOTIC EFFICIENCY OF TESTS 


The Pitman’s asymptotic efficiency is an index of the relative performance of test 
statistics in large samples. This index is called the Pitman’s asymptotic relative 
efficiency (ARE). It was introduced by Pitman in 1948. 

Let X1,..., X, bei.i.d. random variables, having a common distribution F(x; 0), 
6 € ©. Let T,, be a statistic. Suppose that there exist functions w(@) and o,(@) 


so that, for each 6 € ©, Z, = (T, — U(@))/on() as N(O, 1), as n > oo. Often 
0,(0) = c(@)w(n), where w(n) = n~® for some a > 0. 
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Consider the problem of testing the hypotheses Hp : 6 < 0 against H; : @ > 6, 
at level a, — a, asn — ov. Let the sequence of test functions be 


bn(Tn) = {Tn — H(90))/Gn(O0) = Kn}, (7.8.1) 


where k, — Z~-q. The corresponding power functions are 


(7.8.2) 


Wn(0;T,) © ® ( — HG) co) P ny 


winje()  c@) ~~ c@) 
We assume that 


1. (8) is continuously differentiable in the neighborhood of 6, and ju’(O9) > 0; 
2. c(@) is continuous in the neighborhood of 99, and c(@) > 0. 


Under these assumptions, if 0, = @ + dw(m) then, with 6 > 0, 


lim Wn(,;T,) = © Ce = Z1-») =", (7.8.3) 
none (6) 
The function 
Woy 
JOT) = ae (7.8.4) 


is called the asymptotic efficacy of T,,. 
Let V, be an alternative test statistic, and W, = (V, — n(@))/(v@)w(n)) oes 
N(O, 1), asn — oo. The asymptotic efficacy of V, is J(@; V) = (n'(0))?/v7(8). Con- 


sider the case of w(n) = n—!/?, Let 6, = 0) + Te 5 > 0, be a sequence of local 
n 
alternatives. Let W,(6,; V,) be the sequence of power functions at 6,, = 0) + 5/./n 
and sample size n’(n) so that lim Wy (6,3 Vi) = w* = lim Wp (6,3; Tn). For this 
noo n> oo 


= nJ(6;T) 


n'(n) = FG V) (7.8.5) 
and 
on J(6;V) — (n'(8) \? c?6o) 
1 = _ . 7.8.6 
noo n(n) J(6:T) (3B) v2(60) ee) 


This limit (7.8.6) is the Pitman ARE of V,, relative to T,,. 

We remark that the asymptotic distributions of Z,, and W,, do not have to be N(0, 1), 
but they should be the same. If Z,, and W,, converge to two different distributions, the 
Pitman ARE is not defined. 
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7.9 ASYMPTOTIC PROPERTIES OF SAMPLE QUANTILES 


Give a random sample of n i.i.d. random variables, the empirical distribution of the 
sample is 


1 n 
F(x) = — > I{X; <x}, -co <x <0o0. (7.9.1) 
Wat 
This is a step function, with jumps of size 1/n at the location of the sample random 
variables {X;,i = 1,...,}. The pth quantile of a distribution F is defined as 
E,= F~'(p) = inf{x : F(x) => p} (7.9.2) 


according to this definition the quanitles are unique. Similarly, the pth sample quantile 
are defined as &,,, = F'(p). 


Theorem 7.9.1. Let 0 < p < 1. Suppose that F is differentiable at the pth quantile 
E,, and F'(E,) > 0, then En,» > &p a.s. asin —> ov. 


Proof. Lete > 0 then 


F(§, —€)<p< FE) +€). 


By SLLN 

F, (Ep —€) > F(€p — ©) as., a8 n> 00 
and 

Fi(Ep +€) > F(Ep +€) as., as n— oo. 
Hence, 


P{Fin(Ep — €) < p < Fin(Ep + €), ¥m =n} > 1 
as n — oo. Thus, 
Plé, —€ < F,\(p) <&) +6, Vm =n} ol 
as n — oo. That is, 


P {sup |&m,p 3 §,| >el o> 0, ano-w. 
m>n 


QED 
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Note that if 0 < F(&) < 1 then, by CLT, 


lim P [ane SEY < | = Wf), (7.9.3) 
n- 00 VF) — FE) 


for all —oo < t < oo. We show now that, under certain conditions, &,,, is asymptot- 
ically normal. 


Theorem 7.9.2. Let 0 < p < 1. Suppose that F is continuous at &, = F~'(p). 
Then, 


(i) If F'(E»—) > 0 then, for all t < 0, 


: nrg, n— Ey) | 
lim P : = W(t). 7.9.4 
noe cao iar u vey 
(ii) If F'(E +) > 0 then, for allt > 0, 
: nr(g, pa Ep) 
lim P = W(t). 7.9.5 
noo (7A —pVF(E,H) ~ _ ny 


Proof. Fix t. Let A > 0 and define 


1/2 = 

G,(t) = P [Se < | (7.9.6) 
Thus, 

G,(t) = P{Eyp < &) +tAn 7} woes 

= P{p < Fy(&) + tAn7"?)}. - 
Moreover, since nF,(€,) ~ B(n, F(Ep)), 
G,(t) = P {? < Bin, FE, + ran-)| (7.9.8) 

By CLT, 

Bin, F(Ep)) — nF &p) 

lecccrens: <2} +90 me 
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as n —> oo, where F(E,) = 1 — F(&,). Let 


An(t) = F(Ep + tAn7"?), 


and 
FA Oe el 
a J/nXd — A) 
Then 
Galt) = P{Z;(An(@)) = —Ca(t)}, 
where 


n'/?(An(t) — p) 
VA,Od— AO) 


C,(t) = 


Since F is continuous at &,, 


A,(@)U — Ant) > pU — p), as n> ow. 


tA 


Moreover, if t > 0, F(é, + tAn~'/?) — F(,) = ai 
n 


t>0 


A 
C,(t) > ee ey ees as n —> OO. 


Vv pl — p) 
Similarly, if t < 0 


Thus, let 


V pC — p) 
F'(E,—) ” 


Vv pl — p) 
F'(—,+) * 


ift <0, 
A= 
ift > 0. 


Then, lim C,(t) = t. Hence, from (7.9.12), 
noo 


lim G,(t) = ®(t). 


F'(Ep+) + o( 
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(7.9.10) 


(7.9.11) 


(7.9.12) 


(7.9.13) 


1 
) Hence, if 
n 


(7.9.14) 


(7.9.15) 


QED 
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d 
Corollary. Jf F is differentiable at &,, and f (Ep) = TP Olans > 0, then &),p is 
x =5p 


. pl — p) 
asymptotically N (é ez). 
 nf?(Ep) 
PART II: EXAMPLES 
Example 7.1. Let X;, X2,... be a sequence of i.i.d. random variables, such that 


. Ix as, 
E{|X;|} < 00. By the SLLN, X, = —)°X; > y,asn > 00, where pw = E{X)}. 
n 
i=l 
Thus, the sample mean X,, is a strongly consistent estimator of jw. Similarly, if 
E{|X,|"} < oo, r = 1, then the rth sample moment M,,,, is strongly consistent esti- 
mator of uw, = E{X}}, Le., 


n 
1 r as. 
Ma y X; — Uy, asnro., 
i=l 


Thus, if o? = V{X;}, and0 <0? <@, 


a, 2 a.s. 2 
6, = Mn2—(Mni)” — o%. 
n—>oo 


That is, G? 
1 


n—1 


is a strongly consistent estimator of o7. It follows that S? = 


n 
yx iv Kale is also a strongly consistent estimator of o*. Note that, since 
i=l 
Mh es [Ly,aSn — oo whenever E {|X |"} < ov, then for any continuous function 
g(-), @(My,,) —> g(u;),asn > oo. Thus, if 


6, - 8 
(upp? 

is the coefficient of skewness, the sample coefficient of skewness is strongly consistent 

estimator of 6), i.e., 


joe s 
pees — ¥,) 


: “5 Bi, as n> Od. 
(6,)° 


Bin = 


Example 7.2. Let X,, X2,... be a sequence of i.i.d. random variables having a rect- 
angular distribution R(O, 0), 0 < 6 < oo. Since 4; = 0/2, 61 = 2X, isa strongly 
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consistent estimator of 0. The MLE 8on = X,,) is also strongly consistent estimator 
of 6. Actually, since for any 0 < € < @, 


Plbxn <0 —€} = (1-5) , net. 
Hence, by Borel—Cantelli Lemma, P{6..n <6 -—-—€,i.o.} =0. This implies that 
6o.n =". 6, as n > oo. The MLE is strongly consistent. The expected value 
of the MLE is E{62,n} = — 9. The variance of the MLE is 
n 


no2 


Vian = Gp 


The MSE of {6),,} is V{42,n} + Bias?{6),,}, i-e., 


267 


MSE({62,n } — n+bDn+2) : 


2 
The variance of 61n is V{O1n} = an The relative efficiency of bin against 8 n is 
n 


MSE(®),,} 6n 


Rel. eff. = ~ = 
V{A.n} (n + 1)(n + 2) 


’ 


as n —> oo. Thus, in large samples, 2X, is very inefficient estimator relative to the 
MLE. a 


Example 7.3. Let X), X2,..., X;, be i.i.d. random variables having a continuous 
distribution F(x), symmetric around a point 0. @ is obviously the median of the 


distribution, i.c., 0 = F7! x): We index these distributions by 6 and consider the 


location family F, = {Fp : Fe(x) = F(x — 9), and F(—z) = 1 — F(z); -o0 < 0 < 
oo}. The functional form of F is not specified in this model. Thus, 7, is the family 
of all symmetric, continuous distributions. We wish to test the hypotheses 


Ho: 9 = 0 versus H, : 0 > @. 
The following test is the Wilcoxon signed-rank test: 
Let Y¥; = X; — 0,1 = 1,...,n.Let S(Y;) = I{Y; > 0},i = 1,...,n. We consider 


now the ordered absolute values of Y;, i.e., 


IYlay < [Yayo <--: < (¥ia 
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and let R(Y;) be the index (j) denoting the place of Y; in the ordered absolute values, 
Le., R(Y;) = j, j =1,...,n if, and only if, |¥;| = |Y|(;). Define the test statistic 


Tr = | S(Z)RM). 
i=l 


The test of Hp versus H; based on T,,, which rejects Ho if T,, is sufficiently large is 
called the Wilcoxon signed-rank test. We show that this test is consistent. Note that 


1 
under Ho, Po{S(Y;) = 1} = ra Moreover, for eachi = 1,...,, under Hp 


Po{S(¥i) = 1, |¥il < y} = P{O< Vi < y} 
= 5 


1 
= RPO) =) 


= Po{S(¥;) — 1} Pof{\¥i| a y}. 


Thus, S(¥Y;) and |Y;| are independent. This implies that, under Ho, S(Y1),..., S(%n) 
are Higepencent of Bare 1),..., R(Y,,), and the distribution of 7,,, under Hp, is like that 


1 
of T, = Div ~ yi (1 5). It follows that, under Ho, 


= 1 
Lo = Eo{Tn} = ~\j= sade a) 


Similarly, under Ho, 


7 n(n + 1)(2n + 1) 
= =. 4.” 


According to Problem 3 of Section 1.12, the CLT holds, and 


ro| In = Ho =x} > D(x). 
Vv Vo{Tn} ~ 
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Thus, the test function 


. n(n + 1) n(n + 1)Qn + 1) 
(Tn) — 1, if Th = = 4 + Zi-~w 94 


0, otherwise 


has, asymptotically size a, 0 < a < 1. This establishes part (i) of the definition of 
consistency. 
When @ > 69 (under H}) the distribution of 7; is more complicated. We can 


n(n+ 1) 
1 
p. 47) that the asymptotic mean of V,,, as n — ov, is 5 P2(0), and the asymptotic 


consider the test statistic V, = One can show (see Hettmansperger, 1984, 


variance of V,, is 


1 2 
AV = —(pa(@) — p3(9)), 


n 
where 


p28) = Po{Yi + Y2 > O}, 
pa(O) = Po{¥, + Yo > 0, ¥1 + Y3 > O}. 


In addition, one can show that the asymptotic distribution of V,, (under H;) is normal 
(see Hettmansperger, 1984). 


Zia 
a a 24 


act ieaies Mais 2n+1 
8) ie A 1! danin +1) [* 


1 
Finally, when 6 > 0, p2o(@) > 5 and 


3 [ mat 1) eae 
0 n [aw [7 <= 


1 DOn ed 
lim |r > MY Dtep yi | 
n—->o 


4 roe 24 


1 1 
Jn (; am 5p2(0)) 
=1- lm ® 


ee / ps) — p3(O) 


for all 9 > 0. Thus, the Wilcoxon signed-rank test is consistent. | 
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Example 7.4. Let 7,, 7, ..., 7, be i.i.d. random variables having an exponential 
distribution with mean 6, 0 < B < oo. The observable random variables are X; = 
min(7;, t*), i =1,...,; 0 < t* < ow. This is the case of Type I censoring of the 
random variables 7;,..., Ty. 

The likelihood function of 6,0 < B < ov, is 


1 ba 
L(B; Xn) = ge |-3 xi, 


where K,, = SOX: < t*}. Note that the MLE of 6 does not exist if K, = 0. 
i=l 

However, P{K,, = 0} = e7”"'/F —» 0 as n > oo. Thus, for sufficiently large n, the 

MLE of £ is 


x i=] 
Bn = K, 
Note that by the SLLN, 
1 a.s. —1*/B 
—K, — 1-e , ano, 
n 
and 


ie ve 
-\°X; = E{xX,}, asn>o. 
n 

i=] 


Moreover, 
Ti ia ‘ 
E{X,}= al fe Pair re 
B Jo 
t* P 
=p (1 —P (1: >) + tte? 
B 
= Bd —e"/), 
Thus, Bn east B, as n — oo. This establishes the strong consistency of Bn. | 


Example 7.5. Let {X,} be a sequence of i.i.d. random variables, X; ~ N(O, 1), 
—oo < @ < oo. Given a sample of n observations, the minimal sufficient statistic 
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: i 7 
is X, = -S°Xx j- The Fisher information function is /(@) = 1, and X, is a BAN 
n 
j=l 
estimator. Consider the estimator, 


1s if |X \< logn 
A Awan? 1 n| = 
On = 4 2 Jn 
Xn; otherwise. 
Let 
ene er pee 
n— n- n| = Jn 
Now, 
2@(logn) — 1, ifO =0, 
Po{An} = : 
P(logn — fn 0)+ O(logn+/n0)—1, ifO 40. 
Thus, 


1, if6 =0, 
lim P){A,} = 
noo 0, if6 40. 


We show now that 6, is consistent. Indeed, for any 5 > 0, 


Po{\On — O| > 5}= Po{|6, — O| > 8, Tn, (XnJ=1} + PollOn — 0] > 8, Ln, (Xn) =0}. 
If 6 = 0 then 


Po{lOn| > 5} = Po{|Xnl > 28, I, (Xn) = 1} + Pol|Xnl > 5, In,(Xn) = 0} 
< Po{|Xn| > 28} + Po{In,(Xn) = 0} > 0, 


as n > oO, 


since X,, is consistent. Similarly, if 0 ~ 0, 


Po{\On| > 8} < PofIa,(Xn) = 1} + Po{|Xn — | > 6} > 0, 


as n> ©. 


Thus, 9, is consistent. Furthermore, 


1 = 7 = 
= 3 Xnta, (Xn) =F XnCl _ Ty, (Xn)) 


2s fie £2 a 
= Xn = 5 Xun, (Xn). 
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Hence, 


1 
Jb, — 0) oti N 0, i) if6 =0, 


N(O,1), + if6 £0, 


as n —> oo. This shows that 6, is asymptotically normal, with asymptotic variance 


1 
—, if@=0, 
a 4n : 
AVo{On} = 
-, otherwise. 
n 
6, is super efficient. | 


Example 7.6. Let X,, X2,..., X; be i.i.d. random variables, X; ~ B(1, e°),0< 
6 < o&. The MLE of @ after n observations is 


6, = —log =". 
n 


6, does not exist if yx ; = 0. The probability of this event is (1 — e~°)". Thus, if 
i=l 


log 6 
n> N(6,0)= i 


ieee Po pa = | < 6.Forn > N(6, 0), let 


i=1 


= fe Doxa], 
i=1 


A 


Pn 


a) m 
then Po{B,} > 1—6. On the set B,, —— 99 3 Xni On) = i where py, = 
n 


A ? 


n 


1 n 
~S°X; is the MLE of p=e7®. Finally, the Fisher information function is 
sere 
1(0) =e */(1 —e~*), and 
1 oO 


—— —S(X,;6,) —> 1(@). 
yg 3 Knihn) => 106) 


All the conditions of Theorem 7.3.1 hold, and 


1—e?° 


Vib, 6) + (0. = : as n> OO. 
e 
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Example 7.7. Consider again the MLEs of the parameters of a Weibull distribu- 
tionG!/8(A, 1); 0 < B, 4 < oo, which have been developed in Example 5.19. The 
likelihood function L(A, 8; X,,) is specified there. We derive here the asymptotic 
covariance matrix of the MLEs i and 8. Note that the Weibull distributions satisfy 
all the required regularity conditions. 

Let J;;, i = 1,2, j = 1,2 denote the elements of the Fisher information matrix. 
These elements are defined as 


9 2 
Q=E {[ zc ee0. 0:0) | ; 


0 0 
In=E {| pp lo 20. 8:30] | gy lel, x] 


Fy 2 
In = E [a log L(A, #%| | : 


We will derive the formulae for these elements under the assumption of n = | 
observation. The resulting information matrix can then be multiplied by n to yield 
that of a random sample of size n. This is due to the fact that the random variables 
are iid. 

The partial derivatives of the log-likelihood are 


© tog 50; pi X)= += x, 
On rv 

2 {6g T0582 X) = Z +log X — 2X? log X. 
ap B 


Thus, 


I=E : yey: ae 
1= a S52? 


since X? ~ E(A). It is much more complicated to derive the other elements of /(@). 
For this purpose, we introduce first a few auxiliary results. Let M(t) be the moment 
generating function of the extreme-value distribution. We note that 


oe) 
M'(t) = / gee de be ly 


(oe) 


CO 
M"(t) = / a amas | ae 
CO 
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Accordingly, 


d CO 
—Trd+n= ‘ (log x)x'e*dx 
dt 0 


[o.e) 
7 i ze" dz =M(-t), t>-l, 
—0o 
similarly, 
a 
—T(1+t)=M(-t), t>-l. 
prt) = M1) 


These identities are used in the following derivations: 


1 1 
In=E greek Ae kee poe 


or ; 
= ge (3) — 21’(2) + y — logA], 


where y = 0.577216... is the Euler constant. Moreover, as compiled from the tables 
of Abramowitz and Stegun (1968, p. 253) 


T’(2) = 0.42278... and I(3) = 1.84557... 


We also obtain 


1 me 2 " " 
n= 3 1+ = +7 —logay +1") — 20") 


— 2log a(I'"(3) — 21"’(2) D+= toe), 


where I°’(2) = 0.82367 and I'’(3) = 2.49293. The derivations of formulae for [12 
and 2 are lengthy and tedious. We provide here, for example, the derivation of one 
expectation: 


.E{X* (log X)?} = BE UXM log KP, 
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However, X? ~ E(A) ~ iU, where U ~ E(1). Therefore, 


AE{X* (log X)2} = 2E {u( log )| /B° 


2 °° E 
= vel. ne ‘dz —2iog2 f ie dz + ogr?| 


—0o 


ea 5[P"(2) — 2dog A)T '(2) + (log A)*]. 


The reader can derive other expressions similarly. 

For each value of A and 6, we evaluate J,;, 7,2, and Jy). The asymptotic variances 
and covariances of the MLEs, designated by AV and AC, are determined from the 
inverse of the Fisher information matrix by 


A Ino 
AV{A} = Sa 
nT — Ta] 
AV{p} = fin 
alli bo — 17) 

and 
ae —I 
AC(A, B) = se 


nlite — 17) 


Applying these formulae to determine the asymptotic variances and asymptotic 
covariance of 4 and B of Example 5.20, we obtain, for A = 1 and 6 = 1.75, the 
numerical results Jj; = 1, Iy2 = 0.901272, and Ibn = 1.625513. Thus, for n = 50, 
we have AV{\} = 0.0246217, AV{B} = 0.0275935 and AC(A, 8) = —0.0221655. 
The asymptotic standard errors (square roots of AV) of A and # are, 0.1569 and 
0.1568, respectively. Thus, the estimates 4 = 0.839 and f = 1.875 are not signifi- 
cantly different from the true values A = | and 6 = 1.75. a 


1 
Example 7.8. Let X,..., X;, be i.i.d. random variables with X; ~ E (=). O< 


1 
E < oo. Let Y),..., Y, bei.i.d. random variables, Y} ~ G[ -, 1 
UT] 


wee” 


,0 <7 < ~, and 
assume that the Y-sample is independent of the X-sample. 


sh Keg 
The parameter to estimate is 6 = 5 The MLE of 6 is 6, = rae 
ui n 


OF [2n, 2n]. The 


where X,, and 


Y, are the corresponding sample means. For each n > 1, 6, 
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A 262 n 
asymptotic distribution of 0, is N (« ~). 0 <0 < w. 9, is a BAN estimator. To 
n 


find the asymptotic bias of 6,, verify that 


, 0 1 ws 
The bias of the MLE is B(6,) = ae which is of O (=). Thus, we adjust 0, by 
n— n 


ig Ti\ex ie ‘2 A 
Or = (1 _ -) 6,. The bias of 6* is B(O*) = 0. The variance of 6* is 
n 


2 
“= 
n 
Sy een 
~ on 2n nn? i 2 , 


Thus, the second-order deficiency coefficients of 6* is D = 367. Note that 6* is the 
UMVU estimator of 6. a 


Example 7.9. Let X,, X2,..., X, be i.i.d. Poisson random variables, with mean i, 
0 <2 < ow. Weconsider 0 = e*,0 <6 <1. 
The UMVU of 6 is 


where T,, = > oxi. The MLE of 6 is 


i=l 
6, =exp{—Xn}, n>, 


where X, = T;, /n. Note that 6, — 6, —*s 0. The two estimators are asymptotically 
equivalent. Using moment generating functions, we prove that 


E{6,} = exp{—nad — e7!/")} 


Xr 1 
-a 5, 
a — O{—}. 
8 2n* = (=) 
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Thus, adjusting the MLE for the bias, let 


xX, 7 
6* = eX — oF exp{—X,}. 


n 


Note that, by the delta method, 


Te a eco ae 
nexp{—X,,}} = — —). 
: 2n 4n2~ ON 2 


Thus, 
j* =A i 
E{o;}=e"*+O >), ano. 
n 
The variance of the bias adjusted estimator 6* is 
A —X 1 -xX. = .-X 1 
V{6,} = V{e“"} — —cov(e*", Xne-"") + OL =], as n> Ov. 
n ne 


Continuing the computations, we find 


Similarly, 


1 oe e314. — 2 1 
~cov(e*", X,e-*") = eat ) +o : 
n 2n? 


It follows that 


n n2 


Ke new 1 
Vie}= +o . 


In the present example, the bias adjusted MLE is most efficient second-order estima- 
tor. The variance of the UMVU 4, is 


7 rew2* — .2e-24 1 
V{0,} = + +0 ( 5) ‘ 


n 2n2 n 


Thus, 6,, has the deficiency coefficient D = A7e~** /2. a 
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Example 7.10. 


(a) 


(b) 


In n = 100 Bernoulli trials, we observe 56 successes. The model is X ~ 
B(100, 0). The MLE of 6 is 6 = 0.56 and that of g(0) = 2sin“!(./0) is 
g(0.56) = 1.69109. The 0.95-confidence limits for g(@) are gz = g(0.56) — 
Z.975/10 = 1.49509 and gy = 1.88709. The function g(@) varies in the range 
[0, 2]. Thus, let g¥ = max(0, g,) and g7, = min(gy, 7). In the present case, 
both gz, and gy are in (0, 7). The inverse transformation is 


6, = sin?(g,/2), 
Ou = sin’(gy /2). 
In the present case, 0; = 0.462 and 6y = 0.656. We can also, as mentioned 


earlier, determine the approximate confidence limits directly on 6 by estimat- 
ing the variance of @. In this case, we obtain the limits 


Both approaches yield here close results, since the sample is sufficiently large. 


Let (X,Y), ..., (Xn, Yn) be i.i.d. vectors having the bivariate normal distri- 
bution, with expectation vector (€, 7) and covariance matrix 


2 
ee =). —-0 <é, 7<w, 0<o0, T<w, -l<p<l. 


The MLE of ¢ is the sample coefficient of correlation r = D(X; — X)(Y; — 
Y)/[=(X; — X)? - XY; — Y)*]'/?. By determining the inverse of the Fisher 
information matrix one obtains that the asymptotic variance of r is AV{r} = 


| 2)2 : : 1 l+p 
—(1 — o*)*. Thus, if we make the transformation g(p) = 5 log i 
n —?p 


Thus, g(r) = 5 log(1 + 7r)/(. — r)) is a variance stabilizing 


then 


8'(e) = 7 ee 
transformation for r, with an asymptotic variance of |/n. Suppose that in a 


sample of n = 100 we find a coefficient of correlation r = 0.79. Make the 
transformation 


(0.79) = bog? = 1.0714 
< = — log —= 1. a 
6 2 °F 021 
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We obtain on the basis of this transformation the asymptotic limits 


gy = 1.0714 — 0.196 = 0.8754 
gu = 1.0714 + 0.196 = 1.2674. 


The inverse transformation is p = (e7% — 1) / (e78 + 1). Thus, the confidence 
interval of ¢ has the limits o, = 0.704 and py = 0.853. On the other hand, if 
we use the formula 


ne “es r), 


We obtain the limits op, = 0.716 and py = 0.864. The two methods yield 
confidence intervals which are close, but not the same. A sample of size 100 
is not large enough. = 


Example 7.11. In Example 6.6, we determined the confidence limits for the cross- 
ratio producto. We develop here the large sample approximation, according to the 
two approaches discussed above. Let 


=| =] it 
ie ees ie pa ee eee 


Let 6;; = = X;;/nij (, Si S2), Gi) is the MLE of 6;;. Let wij = log(9;;/( — Gij))- 
The MLE of yj; is Wij j= = log(6,; j/A- 6; j)). The asymptotic distribution of Wij j 1S 
normal with mean 7;; and 


A 


6; 
AV lea 7 


— ij 


| = [nij6;;0 -— 6,1". 


Furthermore, the MLE of is 


A 


=u -ve—- vai t+ dm. 


Since X;;, (i, j) = 1, 2, are mutually independent so are the terms on the RHS of 
@. Accordingly, the asymptotic distribution of @ is normal with expectation w and 
asymptotic variance 


2 oD 


i 1 
AVION OLD se a ey 


i=l j=l 
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Since the values 6;; are unknown we substitute their MLEs. We thus define the 
standard error of @ as 


1/2 
5 / 


a a A ae 
SE{@} = » De njjOij(1 6ij) 


i=l j=l 


According to the asymptotic normal distribution of @, the asymptotic confidence 
limits for p are 


6? = pexp{—z1_a/2 - SE{O}}, 


p” — p exp{Z1-0/2SE{O}}, 


where ( is the MLE of o = e®. These limits can be easily computed. For a numerical 
example, consider the following table (Fleiss, 1973, p. 126) in which we present the 
proportions of patients diagnosed as schizophrenic in two studies both performed in 
New York and London. 


New York London 
Study n 6 n 6 
1 105 0.771 #105 0.324 
2 192 0.615 174 0.394 


These samples yield the MLE 6 = 2.9. The asymptotic confidence limits at level 
1—a =0.95 are 6") = 1.38 and 6° = 6.08. This result indicates that the interaction 
parameter p is significantly greater than 1. We show now the other approach, using 


the variance stabilizing transformation 2 sin-!(./@). Let 6, j= (Xi +0.5)/(nij + 1) 
and Y;; = 2 sin”! (6; ;). On the basis of these variables, we set the | — a confidence 
limits for n;; = 2 sin~'(,/6;;). These are 


1 2 
We = Vij -—Z1-a/2//nij and ne = Vij + Z1-a/2//Nij- 
For these limits, we directly obtain the asymptotic confidence limits for ;; that are 


Wi? = 2logtan(fj?/2), k= 1,2, 


where 


a(1 1 oe 
Ae = max(0, ies l, J = 1, 2, 
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and 


n (2, : 2 os 
it = min(z, ae). LjJ= 12: 


We show now how to construct asymptotic confidence limits for o from these asymp- 
totic confidence limits for yj. 
Define 


2 
Q 1 
Rice 5 (k) (k) (k) (k) 
b=5) i Vin ~ War + 07); 
and 


D= hy — WP. 


2 
=1 


i=1 j 


D is approximately equal to com pAV{o}. Indeed, from the asymptotic theory of 
MLEs, Dj; = Ga - are )/4 is approximately the asymptotic variance of w;; times 
Bey j. Accordingly, 


/D = Z1-a/2SE{O}, 


and by employing the normal approximation, the asymptotic confidence limits for 
are 


p = exp{o+(-1'/D}, k= 1,2. 


Thus, we obtain the approximate confidence limits for Fleiss’ example, op“ = 1.40 
and p) = 6.25. These limits are close to the ones obtained by the other approach. 
For further details, see Zacks and Solomon (1976). | 


Example 7.12. Let X;, X2,..., X, be i.i.d. random variables having the gamma 
distribution G(1, v), 0 < v < oo. This is a one-parameter exponential type family, 


with canonical p.d.f. 


f@yv)= t exp{v logx — logl'(v)}. 
x 


Here, K(v) = logl'(v). 
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The MLE of v is the root of the equation 


7 i= 
where U,, = -\> log(X;). The function T’’(v)/ T'(v) is known as the di-gamma, or 
n 


i=1 

psi function, w(v) (see Abramowitz and Stegun, 1968, p. 259). y(v) is tabulated for 
1 < v < 2 inincrements of A = 0.05. For v values smaller than 1 or greater than 2 
use the recursive equation 


1 
pdtv) = v0) + = 


The values of },, can be determined by numerical interpolation. 
The function w(v) is analytic on the complex plane, excluding the points v = 
0, —1, —2,.... The nth order derivative of y(v) is 


tf" e vt 


l-—e 


dt 


woy= Co" | 
0 


ns 1 
=(-D'!)) ——_. 
ag er ee 


Accordingly, 
K'(v) = vQ), 
I(v) = w'(v), 
fe OO) 
(h'(v))3/?" 
PO) 
* WOy 


To assess the normal and the Edgeworth approximations to the distribution of ,, we 
have simulated 1000 independent random samples of size n = 20 from the gamma 
distribution with v = 1. In this case /(1) = 1.64493, 6; = —1.1395 and B, —3 = 
2.4. In Table 7.1, we present some empirical quantiles of the simulations. We see that 
the Edgeworth approximation is better than the normal for all standardized values of 
D, between the 0.2th and 0.8th quantiles. In the tails of the distribution, one could get 
better results by the saddlepoint approximation. a 
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Table 7.1 Normal and Edgeworth 
Approximations to the Distribution of 
Dy, 0 = 20, v =1 


Z Exact Normal Edgeworth 
—1.698 0.01 0.045 0.054 
—1.418 0.05 0.078 0.083 
—1.148 0.10 0.126 0.127 
—0.687 0.20 0.246 0.238 
—0.433 0.30 0.333 0.320 
—0.208 0.40 0.417 0.401 
—0.012 0.50 0.495 0.478 
0.306 0.60 0.620 0.606 
0.555 0.70 0.711 0.701 
0.887 0.80 0.812 0.811 
1.395 0.90 0.919 0.926 
1.932 0.95 0.973 0.981 
2.855 0.99 0.999 0.999 
Example 7.13. Let X,, X2,..., X;, bei.i.d. random variables having the exponential 


distribution X; ~ E(w), 0 < w < oo. This is a one-parameter exponential family 
with canonical p.d.f. 


f(x; ) = exp{wU(x)— K(W)}, O<x < oo, 


where U(x) = —x and K(y) = —log(). 
The MLE of w is fi = 1/X,. The p.d.f. of y, is obtained from the density of X,, 
and is 


(ny Lays 
Tin) xrtt® 


89,0050) = 


> 


for0 <x < ow. 
The approximation to the p.d.f. according to (7.6.9) yields 


89, (03 W) = cn exp(—(x — W)T, — n(— log yy + log x)} 
= Coop exp(n(x — W)/x} 


nin 
= Pas ct Senn 
xrtl 
Substituting c, = n"e~"/T(n) we get the exact equation. a 
Example 7.14. Let X), X2,..., X,, bei.i.d. random variables, having acommon nor- 


mal distribution N(0, 1). Consider the problem of testing the hypothesis Hp : 6 <0 
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against H; : 0 > 0. We have seen that the uniformly most powerful (UMP) test of 
Size a 1S 


: Te 
where X, = -5°X; and and Z;_, = ®~!(1 — a). The power function of this UMP 
n 
i=l 
test is 


Wn0) = B/N 0 - Zia), 9>0. 
Let 0; > 0 be specified. The number of observations required so that y,(0,) > y is 
N(a@, 6,,y) = least integer n greater than (Z, + Z1-a)"/0;. 
Note that 


(i) lim ¥,(61) = 1 for each 6; > 0 
noo 
and 


5 
ii) if 5 > 0 and 6; = —~ then 
(ii) 1 a7 


r) 
tine Fp) = 80 —21-—)= Yom 


where 0 < a < Pou < L. 


Suppose that one wishes to consider a more general model, in which the p.d_-f. of 
X 1 is f(x — 0), —o© < 6 < oo, where f(x) is symmetric about 6 but not necessarily 
equal to d(x), and V,{X} = 0? for all —oo < @ < oo. We consider the hypotheses 
Hy: 0 < Oagainst H,:0 > 0. 

Due to the CLT, one can consider the sequence of test statistics 


(1) v ane 
o, (Xr) =1 AE Te , 
where a, | Zi—q aS in — oo, and the alternative sequence 


$2(X,) =I {¥. > | 
2f(O)/n 


PART II: EXAMPLES 473 


where M, is the sample median 


X(m+1)s ifn =2m+ 1, 
M, = 


5 Xm + X(m41))s ifn = 2m. 


According to Theorem 1.13.7, ./n(M, — 6) aire N (0 ) asn — oo. Thus, 


1 
4770) 
the asymptotic power functions of these tests are 


0 
Vn (0) © ® (eva- Z1-0) 3 SO; 
o 
and 
Wi O) © © (20fO)Vn — Zia), 0 >0. 
Both y()(6,) and (6,) converge to 1 asm — oo, for any 6; > 0, which shows their 
consistency. We wish, however, to compare the behavior of the sequences of power 


functions for 6, = Ua Note that each hypothesis, with 6, ,, = —= is an alternative 
n n 


one. But, since 6,,, — 0 as n > ov, these alternative hypotheses are called local 
hypotheses. Here we get 


and 


yo (=) ~ © (2 f(0)5 — Zi-«) =v 
n Jn a % 


6 
To insure that y* = y** one has to consider for y™ a sequence of alternatives Ta 
n 


with sample size n' = so that 


n 
4 f?(0) 
8) (2 -OwWa =2 =" 
te ()* 210 nv— Ine) =" 


The Pitman ARE of ¢” to ¢( is defined as the limit of n/n'(n) as n > oo. In 
the present example, 


ARES”, 6?) = 4f7(). 
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If the original model of X ~ N(@, 1), f(O) = 


1 
and ARE of to w is 0.637. 
V2 


1 
On the other hand, if f(x) = xe which is the Laplace distribution, then the ARE 
of ¥® to W™ is 1. a 


Example 7.15. In Example 7.3, we discussed the Wilcoxon signed-rank test of Hp : 
@ < versus H,; : 6 > Oo, when the distribution function F is absolutely continuous 
and symmetric around 6. We show here the Pitman’s asymptotic efficiency of this 
test relative to the t-test. The t-test is valid only in cases where o; = V{X} and 
xX, — 6 
0< o; < oo. The f-statistic is t, = /n a where S? is the sample variance. 
n 


. a.s. . 
Since S, —> of, aS n — 00, we consider 


Vn Xn — 00) 
S 


n 


on(tn) = 1 > t-a[n — uf. 


The asymptotic efficacy of the f-test is 
1 
J; tn) = may) ’ 
Of 


where oF is the variance of X, under the p.d.-f. f(x). Indeed, (0) = 6. 
Consider the Wilcoxon signed-rank statistic T,,, given by (7.1.3). The test function, 
for large n, is given by (7.1.8). For this test 


(n— 1) 


HO) = Poi¥i > 0} + —, 


p28), 


where p2(@) is given in Example 7.3. Thus, 


ie a 
HAO th Bg) Jo F(—x — 0)) f(x — 0)dx. 


Hence, 


/ (n—1) [% > 1 
H'0) = f+ [Pedr - 5. 


(n+ DQn+ 1) 
24 


oo 2 
J(0;T,) = 12 (/ Foods) +0 (=). 


Using o7(0) = , we obtain the asymptotic efficacy of 
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as n — oo. Thus, the Pitman ARE of T,, versus t, is 


lore) 2; 
ARE(Tn, tn) = 1207 ( / fodx) : (7.9.16) 


Thus, if f(x) = (x) (standard normal) ARE(T7;,, t,) = 0.9549. On the other hand, if 
1 
f@m= 5 exp{—|x|} (standard Laplace) then ARE(T;,, t,) = 1.5. These results deem 


the Wilcoxon signed-rank test to be asymptotically very efficient nonparametric 
test. a 


Example 7.16. Let X,,..., X,, bei.i.d. random variables, having a common Cauchy 
distribution, with a location parameter 0, —oo < 0 < ov, Le., 


1 


POO eG aaa. 


O<X< OW. 


We derive a confidence interval for 6, for large n (asymptotic). Let 6, be the sample 


median, i.e., 
a 1 
6, = F! (;) ; 
2 


1 
Note that, due to the symmetry of f(x;@) around 0,6 = F7! (;). Moreover, 


1 
f@;0) = —. 


Uv 


Hence, the (1 — a) confidence limits for 6 are 


- TU 
6, + Z)-9/2——=. 
10/25 We 
| 
PART III: PROBLEMS 
Section 7.1 
71.1 Let Xj =a+ 6z;+6,i=1,...,n, be a simple linear regression model, 
where z),..-, 2, are prescribed constants, and €;,...,€, are independent 


2 


random variables with E{e;} = 0 and V{e;} = 07, 0 < 0” < oo, foralli = 


1,...,n. Let &, and B,, be the LSE of a and B. 
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n 
(i) Show that if EG _ Zn) —> ©, asn > ov, then Bn ss B, i.e., Bn iS 
i=1 
consistent. 
(ii) What is a sufficient condition for the consistency of @,,? 


Suppose that X,, X2,...,X,,... are iid. random variables and 0 < 
E{x at < oo. Give a strongly consistent estimator of the kurtosis coefficient 


M4 
P= Gaye 


Let X),..., Xx be ears random variables having binomial distribu- 


tions B(n, 6;),i = 1,...,k. ouaiess the null hypothesis ae A= + = & 
k 


against the alternative H, : 576 — 6) > 0, where 6 = 78. Let pj = 
i=l i=l 


k 
1 
X;/nand p = po pi. Show that the test function 
i=l 


=\2 2 
gx = 1h fe pF > xh 


0, Wnts 


has a size a, converging to a as n — oo. Show that this test is consistent. 


In continuation of Problem 3, define Y; = 2sin7! . /pi,i =1,...,k. 
(i) Show that the asymptotic distribution of Y;,asn — oo, is N(yj, +), where 
ni = 2sin7! /6;. 
k ic 
(ii) Show thatQ =n) (Y; — Y)*, where Y = — ‘Y,, is distributed asymp- 
a 2 me 


k 
totically (as n —> 00) like x*[k — 1;A0)], where (0) = 5 — 7); 


k 
ae 
7= k Dat ;. Thus, prove the consistency of the test. 


(iii) Derive the formula for computing the asymptotic power of the test @(X) = 

1{Q > xi_glk — 1}. 

k 
(iv) Assuming that Soni _ ny is independent of n, how large should n be so 
i=l 
k 
that the probability of rejecting Hp when Soi — i)’ = 107! will not be 
i=1 
smaller than 0.9? 
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Section 7.2 


7.2.1 


7.2.2 


7.2.5 


Let X,, Xo,..., Xn,... be 1.1.d. random variables, X; ~ G(1,v), O<v< 
ve <o. Show that all conditions of Theorem 7.2.1 are satisfied, and hence 
the MLE, ), —> vasn —> oo (strongly consistent). 


Let X,, Xo,..., Xn, ... bei.i.d. random variables, X; ~ B(v, 1),0 < v < w. 
Show that the MLE, »,,, is strongly consistent. 


Consider the Hardy-Weinberg genetic model, in which (Jj, 2) ~ 
MN(n, (pi(), p2(6))), where pi(@) = 6? and p2(0) = 20(1 — 6), 0 < 6 < 
1. Show that the MLE of @, 6,, is strongly consistent. 


Let X,, X2,..., X, be i..d. random variables from G(A, 1), 0 <A < om. 
Show that the following estimators (X,,) are consistent estimators of w(A): 


(i) &(X,) = — log X,, w(A) = loga; 
(ii) (Xn) = X7, w(A) = 1/07; 
(iii) (Xn) = exp{—1/Xn}, o(A) = exp{—A}. 


Let X,,..., X, bei.id. from N(pw, a”), —00 < p< o,0<o0 < ow. Show 
that 


(i) log]. + x) is a consistent estimator of log(1 + UW); 
(ii) (X,/S) is a consistent estimator of @(4/o), where S* is the sample 


variance. 
Section 7.3 
7.3.1 Let (X;, Y;),i=1,...,n bei.i.d. random vectors, where 
Ce) | ierap “ar ) 
a N 3 2 ’ 
Y n (0102 ony 
—00 < €& < 00,0<17 < ~,0 < 0),02 < w~,-—1 < p < I. Find the asymp- 
7 2 _ l n 7 1 n 

totic distributi f W,, = X,/Yn, where X, =—) X;andY, =—) Y;. 

otic distribution o /Yn, where nee an =e 
7.3.2 Let X;, X2,...,Xn,... be iid. random variables having a Cauchy distribu- 


tion with location parameter 0, i.e., 


1 1 
(x30) = 


ee re wo<x<Ww, -w<40<oM. 
ore x 


1 
Let M, be the sample median, or M, = Fo! (;). Is M,. a BAN estimator? 
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7.3.3 


7.3.4 


7.3.5 
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Derive the asymptotic variances of the MLEs of Problems 1-3 of Section 5.6 
and compare the results with the large sample approximations of Problem 4 
of Section 5.6. 


Let X\,...,X, be iid. random variables. The distribution of 
X, as that of N(u,07). Derive the asymptotic variance of the MLE of 
P(u/o). 

Let X1,..., X, be ii.d. random variables having a log-normal distribution 


LN(w, 07). What is the asymptotic covariance matrix of the MLE of é = 
exp{u + a? /2} and D? = &? exp{o? — 1}? 


Section 7.4 


7.4.1 


7.4.2 


7.4.3 


Let X,, X2,..., X, be iid. random variables having a normal distribution 
N(u, 07), -00 < Uw < 00,0<0 <oo. Leto =e". 
(i) What is the bias of the MLE 6,,? 
(ii) Let 6, be the bias adjusted MLE. What is 6,, and what is the order of its 
bias, in terms of n? 


(iii) What is the second order deficiency coefficient of 6: ? 


1 
Let X,, X2,..., X, bei.i.d. random variables, X; ~ G (5 1).0 <B<M. 


Leto. Se" O20 1; 
(i) What is the MLE of 6? 
fe 1 
(ii) Use the delta method to find the bias of the MLE, 6,,, up to O (=). 
n 
(iii) What is the second-order deficiency coefficient of the bias adjusted 
MLE? 


Let X;, X2,..., X, bei.i.d. random variables having a one-parameter canoni- 
cal exponential type p.d.f. Show that the first order bias term of the MLE jf, is 


1 KO) 
Bw) =-s-: . 
2n IW) 
Section 7.5 
7.5.1 Inarandom sample of size n = 50 of random vectors (X, Y) from a bivariate 


normal distribution, —co < wu, n < oc, 0 < 01, 02 < ~, —1 < p < 1, the 
MLE of p is 6 = 0.85. Apply the variance stabilizing transformation to 


determine asymptotic confidence limits of @¢ = sin~!(¢); 7 <@< 
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7.5.2 Let S? be the sample variance in a random sample from a normal distribution 
N(, 07). Show that the asymptotic variance of 


1 1 
W,, = —~log(S?) is AV{W,} = —. 
ae n 


Suppose that n = 250 and S$? = 17.39. Apply the above transformation to 
determine asymptotic confidence limits, at level 1 — a = 0.95, for o?. 


7.5.3 Let X,,..., X, be arandom sample (i.i.d.) from N(y, a7); -0O << O, 


0<0% <a. 


(i) Show that the asymptotic variance of the MLE of o is o7/2n. 
(ii) Determine asymptotic confidence intervals at level (1 — a) for @ = w+ 


Zyo. 
(iii) Determine asymptotic confidence intervals at level (1 — w) for w/o and 
for B(u/o). 
7.5.4 Let X;,..., X, be arandom sample from a location parameter Laplace dis- 


tribution; —oo < uu < oo. Determine a (1 — @)-level asymptotic confidence 
interval for ju. 


Section 7.6 


7.6.1 Let X), X2,...,X, be iid. random variables having a one-parameter 
Beta(v, v) distribution. 
(i) Write the common p.d.f. f(x; v) in a canonical exponential type form. 
(ii) What is the MLE, »,,? 
(iii) Write the Edgeworth expansion approximation to the distribution of the 
MLE },. 


7.6.2 In continuation of the previous problem, derive the p*-formula of the density 
of the MLE, 6,,? 
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7.1.3 Let 0’=(6),...,0) and p'’=(pj,..., px). Let D = (diag(6;(1 — 6)), 
i=1,...,k) be a k xk diagonal matrix. Generally, we denote X, ~ 


AN («. ~v) if /n(X, — &) dy N(O, V) asn — oo. [AN(., -) stands for 
n 


1 
‘asymptotically normal’]. In the present case, p ~ AN @ ~D). 
n 
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Let Hp : 0) = 0) =--- =O = 8. Then, if Hp is true, 


a(l—@ 
p~ AN Ge een). 
n 


: = \2 / 1 ! : 

Now, ) (pi — Pe) =P l(hk—--A)p, where = 1gl,. Since 

k 

i=1 
1 Se oe 1 d 
- ra is idempotent, of rank (kK—1), np’| k&— ra p— 
n—->Oo 
k 
. e 1 

A(1 — 0)x~[k — 1]. Moreover, px = k ) Di > 0 as., asin — oo. Thus, by 


i=1 


Slutsky’s Theorem 


k 
ny-(pi — Px) 
i=l 


ap 
and 
k 
nd (Pi — Px)? 
im P “0 —B) z Xi_alk — jp =a. 


If Ho is not true, AG) — &)* > 0. Also, 


i=1 


k k 
1 4 ete n 2 = en a) 2 
Jim 2M Pry = >; -OY >0 as. 


i=1 


Thus, under H, 


k 
ny i (Pi — pry? 
lim P ) =! 


2 
ee Fe | eed en 
noo Pr — pr) ; 


Thus, the test is consistent. 


ee? ee: St ae 

The MLE of 6 is 6, = as J; ~ B(n, 62). Hence, 6? and 
n n 

J: a.s. aA 

~ =. 20(1 — 6). Thus, lim 6, = @ as. 

n noo 
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Si 
7.3.1 By SLLN, lim — = 5 a.s. 
n>o Y, n 


> 
| 
| 
3 [yr 
“n—” 
Js 
x 
S 
= 
z 
= 
1 
8 


where 


7.3.2 As shown in Example 7.16, 


1 


SN ale GeO 


O<X< OW. 


Also, 


On the other hand, the Fisher information is [,(0) = Thus, AV(6,) = 
a 1 
—> 

4n I,(0) 


2 Re 
= —. Thus, 6, is not a BAN estimator. 
n 


7.4.2 


(i) X ~ BGC, 1). Hence, the MLE of £ is X,,. It follows that the MLE of 
6=e!/B is 6, = @ An, 


(ii) 


, - ae ie 2 ed 
On — 8 = (Xn — Py ase we : (Xn _ Bye fee B (5 = 2) 
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Hence, 
Bias(6,,) = E{6, — 0} 
eee 1 1 
— B id 2 + +o\— 
2np3 B n 


eWl/B 1 


The bias adjusted estimator is 


1 1 
(iii) Let f(x) = e7!/* (1 + — — —— ]. Then 
2nx2 


nx 


fa sem (14. -— 
x)= —<e —- 
, x? nx  2nx? 
1 1 
SY es oA ile 
ae ( nx? 3 =) 


re ay ee 
aos no nx 2nx2)° 


It follows that 


a e2/B Qe~2/B 2 1 1 
AV{é,} = np? wp? 1 Bt Op +0 oo 


Accordingly, the second-order deficiency coefficient of bn, is 


pin eee: yo 
7 Z| ah 
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7.6.1 Xj ,..., X, are i.i.d. like Beta(v, v),0 < v < oc. 
@ fry) = 


Poa aY , Oe ee 


Ba, v) 


1 
_ e? los(xU—x)) 


~ Biv, v)x(1 — x) 


ee exp{v log(x(1 — x)) — K(v)}, 
x(1 — x) 


where K(v) = log B(, v). 
(ii) The likelihood function is equivalent to 


L(v | X) = exp ( SD log(X;(1 — X;)) — ro) : 
i=l 
The log likelihood is 
Kv | X) =v) log(X;(1 — X;)) —nK(v). 
i=1 


Note that the derivative of K(v) is 


Kv) =2 Ge - ron) 


T(v) TQv) 
It follows that the MLE of v is the root of 


l(v) I’Q@v) 
Tw) = T@Qv) 


mee Y“ log(Xi(1 — Xi) = — 
2n amr 


d 
The function 7p 108 T(v) is also called the psi function, 
v 


d 
ie, Wv)= aa logI’(v). As shown in Abramowitz and Stegun 
v 
1 1 
(1968, p. 259), w(2v) — WO) = itd (v + ;) — w(v)) + log 2. Also, 


1 n 
--)) log(X;(1 — X;)) > log4. Thus, the MLE is the value of v for 
n 
i=1 


which 


! ee orp ee ee he 
v(r45)-vo= TD og(X;(1 — i) — 5 logt ). 
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1 
(iii) Since Beta(v, v) is symmetric distribution around x = 5° X~(1-X) 


and hence 

4 (x log(X;(1 — x) = AnV {log X}. 

i=] 
Thus, since X,,...,X, are ii.d., the Fisher information is J(v) = 
4V {log X}. The first four central moments of Beta(v, v) are yu* = 0; 
2v? +8 5 
wy = ——~; 23 = Oand py = ea a Thus, 
4(2v + 1) 16(2v + 1) 
B; =0, 

and 


Bo = (2v? + 8v +5)(2v + 1). 


It follows that the Edgeworth asymptotic approximation to the distribu- 
tion of the MLE, j,,, is 


Be 2 
P{/nl(0) On — v) < x) & x) — 2(- +8v+5 ) 


24n 16(2v + 1) 


CHAPTER 8 


Bayesian Analysis in Testing 
and Estimation 


PART I: THEORY 


This chapter is devoted to some topics of estimation and testing hypotheses from the 
point of view of statistical decision theory. The decision theoretic approach provides 
a general framework for both estimation of parameters and testing hypotheses. The 
objective is to study classes of procedures in terms of certain associated risk functions 
and determine the existence of optimal procedures. The results that we have presented 
in the previous chapters on minimum mean-squared-error (MSE) estimators and on 
most powerful tests can be considered as part of the general statistical decision theory. 
We have seen that uniformly minimum MSE estimators and uniformly most powerful 
tests exist only in special cases. One could overcome this difficulty by considering 
procedures that yield minimum average risk, where the risk is defined as the expected 
loss due to erroneous decision, according to the particular distribution Fy. The MSE 
in estimation and the error probabilities in testing are special risk functions. The 
risk functions depend on the parameters @ of the parent distribution. The average 
risk can be defined as an expected risk according to some probability distribution on 
the parameter space. Statistical inference that considers the parameter(s) as random 
variables is called a Bayesian inference. The expected risk with respect to the 
distribution of @ is called in Bayesian theory the prior risk, and the probability 
measure on the parameter space is called a prior distribution. The estimators or 
test functions that minimize the prior risk, with respect to some prior distribution, 
are called Bayes procedures for the specified prior distribution. Bayes procedures 
have certain desirable properties. This chapter is devoted, therefore, to the study of 
the structure of optimal decision rules in the framework of Bayesian theory. We start 
Section 8.1 with a general discussion of the basic Bayesian tools and information 
functions. We outline the decision theory and provide an example of an optimal 
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statistical decision procedure. In Section 8.2, we discuss testing of hypotheses from 
the Bayesian point of view, and in Section 8.3, we present Bayes credibility intervals. 
The Bayesian theory of point estimation is discussed in Section 8.4. Section 8.5 
discusses analytical and numerical techniques for evaluating posterior distributions 
on complex cases. Section 8.6 is devoted to empirical Bayes procedures. 


8.1 THE BAYESIAN FRAMEWORK 


8.1.1 Prior, Posterior, and Predictive Distributions 


In the previous chapters, we discussed problems of statistical inference, testing 
hypotheses, and estimation, considering the parameters of the statistical models as 
fixed unknown constants. This is the so-called classical approach to the problems 
of statistical inference. In the Bayesian approach, the unknown parameters are con- 
sidered as values determined at random according to some specified distribution, 
called the prior distribution. This prior distribution can be conceived as a normal- 
ized nonnegative weight function that the statistician assigns to the various possible 
parameter values. It can express his degree of belief in the various parameter values 
or the amount of prior information available on the parameters. For the philosophi- 
cal foundations of the Bayesian theory, see the books of DeFinneti (1974), Barnett 
(1973), Hacking (1965), Savage (1962), and Schervish (1995). We discuss here only 
the basic mathematical structure. 

Let F = {F(x;0);0 € ©} be a family of distribution functions specified by the 
statistical model. The parameters 6 of the elements of F are real or vector valued 
parameters. The parameter space © is specified by the model. Let H be a family of 
distribution functions defined on the parameter space ©. The statistician chooses an 
element H (0) of 1 and assigns it the role of a prior distribution. The actual parameter 
value 0 of the distribution of the observable random variable X is considered to be 
a realization of a random variable having the distribution H(@). After observing the 
value of X the statistician adjusts his prior information on the value of the parameter 
6 by converting H(@) to the posterior distribution H(0 | X). This is done by Bayes 
Theorem according to which if h(@) is the prior probability density function (p.d.f.) 
of 6 and f(x;6) the p.d.f. of X under 6, then the posterior p.d.f. of @ is 


h@|x)= no fsi)/ | f(x; 0)dH(@). (8.1.1) 
2) 
If we are given a sample of n observations or random variables X,, X2,..., Xn, 


whose distributions belong to a family F, the question is whether these random 
variables are independent identically distributed (i.i.d.) given 6, or whether 6 might 
be randomly chosen from H(@) for each observation. 

At the beginning, we study the case that X;,..., X, are conditionally i.i.d., given 0. 
This is the classical Bayesian model. In Section 8.6, we study the so-called empirical 
Bayes model, in which @ is randomly chosen from H(@) for each observation. In 
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the classical model, if the family F admits a sufficient statistic T(X), then for any 
prior distribution H(@), the posterior distribution is a function of T(X), and can be 
determined from the distribution of T(X) under 6. Indeed, by the Neyman—Fisher 
Factorization Theorem, if 7(X) is sufficient for F then f(x;0) = k(x)g(T(x); 9). 
Hence, 


A@|x)= ne@rgr:0)/ f g(T (x); 0)dH(@). (8.1.2) 
19) 


Thus, the posterior p.d.f. is a function of T(X). Moreover, the p.d.f. of T(X) is 
g*(t;0) = k*(t)g(t; 6), where k*(t) is independent of 0. It follows that the conditional 
p.d.f. of 6 given {T(X) = t} coincides with h(@ | x) on the sets {x; T(x) = t} for 
all ¢. 

Bayes predictive distributions are the marginal distributions of the observed 
random variables, according to the model. More specifically, if a random vector X 
has a joint distribution F(x; @) and the prior distribution of 6 is H(@) then the joint 
predictive distribution of X under H is 


Face = [ F(x; 0)dH(@). (8.1.3) 
(2) 


A most important question in Bayesian analysis is what prior distribution to choose. 
The answer is, generally, that the prior distribution should reflect possible prior 
knowledge available on possible values of the parameter. In many situations, the 
prior information on the parameters is vague. In such cases, we may use formal prior 
distributions, which are discussed in Section 8.1.3. On the other hand, in certain 
scientific or technological experiments much is known about possible values of the 
parameters. This may guide in selecting a prior distribution, as illustrated in the 
examples. 

There are many examples of posterior distribution that belong to the same para- 
metric family of the prior distribution. Generally, if the family of prior distributions 
H relative to a specific family F yields posteriors in H, we say that F and 1 are 
conjugate families. For more discussion on conjugate prior distributions, see Raiffa 
and Schlaifer (1961). In Example 8.2, we illustrate a few conjugate prior families. 

The situation when conjugate prior structure exists is relatively simple and gener- 
ally leads to analytic expression of the posterior distribution. In research, however, 
we often encounter much more difficult problems, as illustrated in Example 8.3. In 
such cases, we cannot often express the posterior distribution in analytic form, and 
have to resort to numerical evaluations to be discussed in Section 8.5. 


8.1.2 Noninformative and Improper Prior Distributions 


It is sometimes tempting to obtain posterior densities by multiplying the likelihood 
function by a function 4(@), which is not a proper p.d.f. For example, suppose that 
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1 
X |0~ N(, 1). In this case L(@; X) = exp 5 @ _ wrt This likelihood function 


is integrable with respect to dé. Indeed, 


exp | 50 = x? a9 = J2z. 


Thus, if we consider formally the function h(@)d@ = cdé@ or h(@) = c then 


h(o| X)= 50 xy | (8.1.4) 


1 
exp 
JV 20 
which is the p.d.f. of N(X, 1). The function 1(@) = c, c > 0 for all 6 is called an 
CO 


improper prior density since cd@ = o. Another example is when X | A ~ 


—0o 
P(A), ic., L(A | X) = e~*4*. If we use the improper prior density h(A) = c > 0 for 
all 2 > O then the posterior p.d_f. is 


1 
h(a | X)= ye 0<A<o. (8.1.5) 


This is a proper p.d.f. of G1, X + 1) despite the fact that (A) is an improper prior 
density. Some people justify the use of an improper prior by arguing that it provides 
a “diffused” prior, yielding an equal weight to all points in the parameter space. For 
example, the improper priors that lead to the proper posterior densities (8.1.4) and 
(8.1.5) may reflect a state of ignorance, in which all points @ in (—oo, 00) or A in 
(0, oo) are “equally” likely. 

Lindley (1956) defines a prior density h(@) to be noninformative, if it maximizes 
the predictive gain in information on 8 when arandom sample of size n is observed. He 
shows then that, in large samples, if the family F satisfies the Cramer—Rao regularity 
conditions, and the maximum likelihood estimator (MLE) 6, is minimal sufficient 
for F, then the noninformative prior density is proportional to |/(@)|!/*, where |7(0)| 
is the determinant of the Fisher information matrix. As will be shown in Example 
8.4, h() x |1(4)|'/? is sometimes a proper p.d.f. and sometimes an improper one. 

Jeffreys (1961) justified the use of the noninformative prior |/(6)|!/* on the 
basis of invariance. He argued that if a statistical model F = { f(x; 0); € ©} is 
reparametrized to F* = { f*(x;@);@ € Q}, where w = (6) then the prior density 
h(@) should be chosen so that h(@ | X) = h(w | X). 

Let 06 = @!() and let J(w) be the Jacobian of the transformation, then the 
posterior p.d.f. of @ is 


h*(@ | X) x h(P'(@) fs "(@)|J@)I. (8.1.6) 
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Recall that the Fisher information matrix of @ is 


P(@) = J"(@)1"'(@)) J"). (8.1.7) 
Thus, if H(0) « |J(0)|!/2 then from (8.1.7) and (8.1.8), since 
IP)" = 1G")! /|J@), (8.1.8) 
we obtain 
h*(w | X) « f(xs¢ I)”. (8.1.9) 


The structure of h(@ | X) and of h*(w | X) is similar. This is the “invariance” property 
of the posterior, with respect to transformations of the parameter. 
A prior density proportional to |/(@)|!/ is called a Jeffreys prior density. 


8.1.3. Risk Functions and Bayes Procedures 


In statistical decision theory, we consider the problems of inference in terms of 
a specified set of actions, A, and their outcomes. The outcome of the decision is 
expressed in terms of some utility function, which provides numerical quantities 
associated with actions of A and the given parameters, 6, characterizing the elements 
of the family F specified by the model. Instead of discussing utility functions, we 
discuss here loss functions, L(a, 0), a € A, @ € ©, associated with actions and 
parameters. The loss functions are nonnegative functions that assume the value zero 
if the action chosen does not imply some utility loss when @ is the true state of Nature. 
One of the important questions is what type of loss function to consider. The answer 
to this question depends on the decision problem and on the structure of the model. 
In the classical approach to testing hypotheses, the loss function assumes the value 
zero if no error is committed and the value one if an error of either kind is done. Ina 
decision theoretic approach, testing hypotheses can be performed with more general 
loss functions, as will be shown in Section 8.2. In estimation theory, the squared-error 
loss function (6(x) — 6) is frequently applied, when 6(x) is an estimator of 0. A 
generalization of this type of loss function, which is of theoretical importance, is the 
general class of quadratic loss function, given by 


L(A(x), 0) = O(0)(O(x) — 6), (8.1.10) 


where Q(@) > 0 is an appropriate function of 0. For example, (O(x) — 0) /07 isa 
quadratic loss function. Another type of loss function used in estimation theory is 
the type of function that depends on 6(x) and 6 only through the absolute value 
of their difference. That is, L(@(x), 9) = W(\@(x) — 6|). For example, |A(x) — 6|” 
where v > 0, or log(1 + \O(x) — @|). Bilinear convex functions of the form 


L(6, 0) = a,(6 — 0)> +.a(6 — 6)t (8.1.11) 
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are also in use, where aj, dz are positive constants; (6 6)- = min(6 6,0) and 
(6 —6)yt= max(6 — 0,0). If the value of 0 is known one can always choose a proper 
action to insure no loss. The essence of statistical decision problems is that the true 
parameter @ is unknown and decisions are made under uncertainty. The random 
vector X = (X1,..., X,) provides information about the unknown value of 0. A 
function from the sample space 4 of X into the action space A is called a decision 
function. We denote it by d(X) and require that it should be a statistic. Let D denotes 
a specified set or class of proper decision functions. Using a decision function d(X) 
the associated loss L(d(X), 9) is a random variable, for each 0. The expected loss 
under 6, associated with a decision function d(X), is called the risk function and 
is denoted by R(d, 0) = Eg{L(d(X), @)}. Given the structure of a statistical decision 
problem, the objective is to select an optimal decision function from D. Ideally, we 
would like to choose a decision function d°(X) that minimizes the associated risk 
function R(d, 6) uniformly in 6. Such a uniformly optimal decision function may not 
exist, since the function d° for which R(d°, 6) = inf R(d, @) generally depends on 


the particular value of 6 under consideration. There are several ways to overcome this 
difficulty. One approach is to restrict attention to a subclass of decision functions, like 
unbiased or invariant decision functions. Another approach for determining optimal 
decision functions is the Bayesian approach. We define here the notion of Bayes 
decision function in a general context. 

Consider a specified prior distribution, H(@), defined over the parameter space 
©. With respect to this prior distribution, we define the prior risk, p(d, H), as the 
expected risk value when @ varies over ©, ie., 


p(d, H) = / R(d, 0)h(@)dé, (8.1.12) 
(o) 


where (0) is the corresponding p.d.f. A Bayes decision function, with respect to 
a prior distribution H, is a decision function dy(x) that minimizes the prior risk 
eld, Hf), Le., 


p(du, H) = inf p(d, Hf). (8.1.13) 


Under some general conditions, a Bayes decision function d;;(x) exists. The Bayes 
decision function can be generally determined by minimizing the posterior expecta- 
tion of the loss function for a given value x of the random variable X. Indeed, since 
L(d, 6) > 0 one can interchange the integration operations below and write 


p(d, H)= / {/ L(d(x), aseiedr| h(@)dé 
(3) x 


=f taco f Leds). OO aa as 
x e Su) 


(8.1.14) 
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where f(x) = | f(x;t)h(t)dt is the predictive p.df. The conditional p.d.f. 


h(O | x) = f(x; 9)h(@)/fr(x) is the posterior p.d.f. of 6, given X = x. Similarly, 
the conditional expectation 


R(d(x), H) = i L(d(x), 0)h(O | x)dé (8.1.15) 
3) 


is called the posterior risk of d(x) under H. Thus, for a given X = x, we can 
choose d(x) to minimize R(d(x), H). Since L(d(x), 9) > Oforall@ € Oandd € D, 
the minimization of the posterior risk minimizes also the prior risk p(d, H). Thus, 
dy(X) is a Bayes decision function. 


8.2. BAYESIAN TESTING OF HYPOTHESIS 


8.2.1 Testing Simple Hypothesis 


We start with the problem of testing two simple hypotheses Hp and Hj. Let Fo(x) 
and F(x) be two specified distribution functions. The hypothesis Ho specifies the 
parent distribution of X as Fo(x), Hi specified it as F)(x). Let fo(x) and fi(x) be 
the p.d.f.s corresponding to F(x) and F)(x), respectively. Let 7,0 < a < 1, be the 
prior probability that Hp is true. In the special case of two simple hypotheses, the loss 
function can assign | unit to the case of rejecting Hp when it is true and Db units to 
the case of rejecting H; when it is true. The prior risks associated with accepting Ho 
and Hj are, respectively, oo(77) = (1 — 7)b and p\(7) = a. For a given value of z, 
we accept hypothesis H; (i = 0, 1) if p;(zr) is the minimal prior risk. Thus, a Bayes 
rule, prior to making observations is 


rin (" if > b/(1 +5), (8.2.1) 


1, otherwise, 


where d = i is the decision to accept H; (i = 0, 1). 

Suppose that a sample of 7 i.i.d. random variables X,,..., X, has been observed. 
After observing the sample, we determine the posterior probability 7(X,,) that Hp is 
true. This posterior probability is given by 


m(X) =x] | folX)/|x-[[AXpt+d-m][AX)]}. 2.2) 


j=l j=l j=l 
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We use the decision rule (8.2.1) with z replaced by 2 (X,,). Thus, the Bayes decision 
function is 


; b 
ie jsae' MROVE Tey (8.2.3) 


1, otherwise. 


The Bayes decision function can be written in terms of the test function discussed in 
Chapter 4 as 


eT fiGa. 2 
ja f0(X)) ~ bw (8.2.4) 


0, otherwise. 


’ 


bn (X;,) = 


The Bayes test function ¢,(X,,) is similar to the Neyman—Pearson most powerful 
test, except that the Bayes test is not necessarily randomized even if the distributions 


n 
F;(x) are discrete. Moreover, the likelihood ratio I] Si (X j)/fo(X ;) 1s compared to 
j=1 
the ratio of the prior risks. A 
We discuss now some of the important optimality characteristics of Bayes tests 
of two simple hypotheses. Let Ro(@) and R;(@) denote the risks associated with an 
arbitrary test statistic 6, when Ho or H; are true, respectively. Let Ro(z) and Ri (77) 
denote the corresponding risk values of a Bayes test function, with respect to a prior 
probability 2. Generally 


Roh) = ceo); O<c, < co 
and 

Ri) = 261); O<c2 < oO, 
where €9(@) and €;(@) are the error probabilities of the test statistic @, c; and cz 
are costs of erroneous decisions. The set R = {Ro(), Ri(@)); all test functions ¢} 
is called the risk set. Since for every 0 < w < 1 and any functions 6 and 6, 
ad + (1 — a) is also a test function, and since 

Ri(agp™ + (1 — ap) = a RG) + L-@®) RG), 1 =0,1 (8.2.5) 

the risk set R is convex. Moreover, the set 


S = {(Ro(z), Rit); O< wm < I} (8.2.6) 


of all risk points corresponding to the Bayes tests is the lower boundary for R. Indeed, 
according to (8.2.4) and the Neyman—Pearson Lemma, Rj (77) is the smallest possible 
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risk of all test functions @ with Ro(@) = Ro(z). Accordingly, all the Bayes tests 
constitute a complete class in the sense that, for any test function outside the class, 
there exists a corresponding Bayes test with a risk point having component smaller 
or equal to those of that particular test and at least one component is strictly smaller 
(Ferguson, 1967, Ch. 2). From the decision theoretic point of view there is no sense in 
considering test functions that do not belong to the complete class. These results can 
be generalized to the case of testing k simple hypotheses (Blackwell and Girshick, 
1954; Ferguson, 1967). 


8.2.2 Testing Composite Hypotheses 


Let © and ©, be the sets of 6-points corresponding to the (composite) hypotheses 
Ho and Hj, respectively. These sets contain finite or infinite number of points. Let 
H(@) be a prior distribution function specified over © = @g U ©. The posterior 
probability of Ho, given n i.i.d. random variables X1,..., X,, is 


i A Tl (Xi; 0)dH(6) 


ce 


m(X,) = (8.2.7) 
[ [Lisa 


where f(x; 0) is the p.d.f. of X under 9. The notation in (8.2.7) signifies that if the sets 
are discrete the corresponding integrals are sums and dH(@) are prior probabilities, 
otherwise dH(@) = h(0)d0, where h(@) is a p.d.f. The Bayes decision rule is obtained 
by computing the posterior risk associated with accepting Hp or with accepting A 
and making the decision associated with the minimal posterior risk. The form of the 
Bayes test depends, therefore, on the loss function employed. 

If the loss functions associated with accepting Hp or H; are 


Lo(@) = col{O € Oy} and L1(6) = cy 1{O € Oo} 
then the associated posterior risk functions are 

Ro(X) = co i, f(X; 0)dH(@)/ [ f(X; 0)dH(@) (8.2.8) 
and 


RXH=c [ fexionatto)/ | f(X; 0)dH(@). 
Oo (3) 
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In this case, the Bayes test function is 


1, if X; 6)dH(0 X; 6)dH(0), 
ee iter fF ) @<eof F \dH(0) 


0, otherwise. 


(8.2.9) 


In other words, the hypothesis Ho is rejected if the predictive likelihood ratio 
Apn(X) = / f(x; enarney/ f f(X; 0)dH(@) (8.2.10) 
©, ©o 


is greater than the loss ratio cj/co. This can be considered as a generalization of 
(8.2.4). The predictive likelihood ratio A(X) is called also the Bayes Factor in 
favor of H, against Hp (Good, 1965, 1967). 

Cornfield (1969) suggested as a test function the ratio of the posterior odds in 
favor of Hp, i.e., PL Ho | X]/(1 — P[Ho | X)), to the prior odds 2 /(1 — 1) where x = 
P[Hp] is the prior probability of Ho. The rule is to reject Hp when this ratio is smaller 
than a suitable constant. Cornfield called this statistic the relative betting odds. Note 
that this relative betting odds is [A y(X)2/(1 — r)|~!. We see that Cornfield’s test 
function is equivalent to (8.2.9) for suitably chosen cost factors. 

Karlin (1956) and Karlin and Rubin (1956) proved that in monotone likelihood 
ratio families the Bayes test function is monotone in the sufficient statistic T(X). For 
testing Hp : 0 < % against H; : 0 > 0, the Bayes procedure rejects Hy whenever 
T(X) = &. The result can be further generalized to the problem of testing multiple 
hypotheses (Zacks, 1971; Ch. 10). 

The problem of testing the composite hypothesis that all the probabilities in a 
multinomial distribution have the same value has drawn considerable attention in the 
statistical literature; see in particular the papers of Good (1967), Good and Crook 
(1974), and Good (1975). The Bayes test procedure proposed by Good (1967) is based 
on the symmetric Dirichlet prior distribution. More specifically if X = (X1,..., X,)’ 
is a random vector having the multinomial distribution M(n, 6) then the parameter 
vector @ is ascribed the prior distribution with p.d.f. 


k 


_ ky) 5 
h(O\,...,) = ei IT¢ : (8.2.11) 


k 
1 
0<0,,...,0 < 1 and >°6; = |. The Bayes factor for testing hy : 0 = i against 
i=l 
1 
the composite alternative hypothesis H; : 0 4 a where 1 = (1,..., 1)’, according 
to (8.2.10) is 


k 
k'rkv)] [rw ae 
i=1 


A) SER GEa 


(8.2.12) 
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From the purely Bayesian point of view, the statistician should be able to choose an 
appropriate value of v and some cost ratio c;/co for erroneous decisions, according 
to subjective judgment, and reject Ho if A(v; X) => c1/co. In practice, it is generally 
not so simple to judge what are the appropriate values of v and c,/co. Good and 
Crook (1974) suggested two alternative ways to solve this problem. One suggestion 
is to consider an integrated Bayes factor 


[o.e} 
MX) = [ sOAw:Xdv, (8.2.13) 
0 
where ¢(v) is the p.d.f. of a log-Cauchy distribution, i.e., 


1 1 
o(v) = IE 7 1 + (log v)?’ 


0<v<oo. (8.2.14) 
The second suggestion is to find the value v° for which A(v;X) is maximized and 
reject Hp if A* = (2log A(v9; X))!/? exceeds the (1 — a)-quantile of the asymptotic 
distribution of A* under Hp. We see that non-Bayesian (frequentists) considerations 
are introduced in order to arrive at an appropriate critical level for A*. Good and 
Crook call this approach a “Bayes/Non-Bayes compromise.” We have presented this 
problem and the approaches suggested for its solution to show that in practical work 
a nondogmatic approach is needed. It may be reasonable to derive a test statistic in a 
Bayesian framework and apply it in a non-Bayesian manner. 


8.2.3 Bayes Sequential Testing of Hypotheses 


We consider in the present section an application of the general theory of Section 8.1.5 
to the case of testing two simple hypotheses. We have seen in Section 8.2.1 that the 
Bayes decision test function, after observing X,,, is to reject Ho if the posterior proba- 
bility, 7(X,,), that Ho is true is less than or equal to a constant z*. The associated Bayes 
risk is p (1 (X,)) = m(X,,)I{m(X,) < m*} +b — 2(X,,)) I {7 (X,,) > a*}, where 
z* = b/(1+ b). If w(X,,) = a then the posterior probability of Hp after the (n + 1)st 
a fi) 
fo(x) 


likelihood ratio. The predictive risk associated with an additional observation is 


_ 
observation is Ww (7, Xn41) = (1 + 2 RRs , where R(x) = is the 
a 


Ail) =c + E{p (a, X))}, (8.2.15) 


where c is the cost of one observation, and the expectation is with respect to the 
predictive distribution of X given zr. We can show that the function 6; (zr) is concave 
on [0, 1] and thus continuous on (0, 1). Moreover, (;(0) > c and 6;(1) = c. Note 
that the function w(z, X) > 0 w.p.lif 7 > 0 and wiz, X) > l wplifa— 1. 
Since p(z) is bounded by z*, we obtain by the Lebesgue Dominated Convergence 
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Theorem that E{p°(w(z, X))} > Oasa — Oorasa — 1.The Bayes risk associated 
with an additional observation is 


p(t) = min{p ar), pi ()}. (8.2.16) 


Thus, ifc > b/(1 + b) it is not optimal to make any observation. On the other hand, if 
c < b/(1 + b) there exist two points a,” and iy such that 0 < eee <a < ns < 


1, and 


(1) dd) 


Let 

pr) =c+ E{pCWa, X)}, O<r <1, (8.2.18) 
and let 

pr) = min{p (rr), pr(a)}, OS <1. (8.2.19) 


Since pO (Wi, X)) < p°(Wia, X)) for each m with probability one, we obtain that 
fol) < pi (rr) for allO < mw < 1. Thus, op (7) < pr) forallz,0 <a < 1. fo(z) 
is also a concave function of z on [0, 1] and 62(0) = 62(1) = c. Thus, there exists 
ie < me and i > road such that 


(0) : (2) (2) 
P(r) eS (é (x), ifm <n orm >z,, (8.2.20) 


fpo(), otherwise. 
We define now recursively, for each z on [0, 1], 
Pa) =c+ Efe’ Wa, X)}, n=l, (8.2.21) 
and 
(ar) = min{p (7), A 8.2.22 
p- (1) = min{p™ (st), Pn()}. (8.2.22) 
These functions constitute for each 2 monotone sequences /,(7) < P,_; and 
p(x) < pr) for every n > 1. Moreover, for each n there exist 0 < xy” < 
no? ny Say” < 1 Such that 


(0) : (n) (n) 
p(s) cat {° (z), ifm <m,’ orm >’, (8.2.23) 


Pn), otherwise. 
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Let p(z) = lim p™ (1) for each z in [0, 1] and p(7) = Ef{p(w(z, X))}. By the 
noo 
Lebesgue Monotone Convergence Theorem, we prove that O(77) = lim /,(7) for 
noo 


each z € [0, 1]. The boundary points ae and mz,’ converge to 7, and 2, respec- 


tively, where 0 < 2, < m2 < 1. Consider now a nontruncated Bayes sequential pro- 
cedure, with the stopping variable 


(n) 
2 


N = min{n > 0: pG(X,)) = pr(X,))}, (8.2.24) 


where Xo = O and z(X0) = a. Since under Ho, 7(X,,) — 1 with probability one and 
under H,, 2(X,) — 0 with probability 1, the stopping variable (8.2.24) is finite with 
probability one. 

It is generally very difficult to determine the exact Bayes risk function p(z) and the 
exact boundary points z, and zr2. One can prove, however, that the Wald sequential 
probability ratio test (SPRT) (see Section 4.8.1) is a Bayes sequential procedure in 
the class of all stopping variables for which N > 1, corresponding to some prior 
probability z and cost parameter b. For a proof of this result, see Ghosh (1970, p. 93) 
or Zacks (1971, p. 456). A large sample approximation to the risk function p(z) 
was given by Chernoff (1959). Chernoff has shown that in the SPRT given by the 
boundaries (A, B) if A > —oco and B > ow, we have 


T(0, Lb — 27) 

& , 

a 

T(1, 0) 
i= 


A ® logc —lo 
(8.2.25) 


’ 


1 
B & log — + log 
c 


where the cost of observations c > 0 and /(0, 1), 7(1, 0) are the Kullback—Leibler 
information numbers. Moreover, as c > 0 


l-az 
+ ) ; (8.2.26) 


p(s) © (—c logc) eae 7.0) 


Shiryayev (1973, p. 127) derived an expression for the Bayes risk p(zr) associated 
with a continuous version of the Bayes sequential procedure related to a Wiener 
process. Reduction of the testing problem for the mean of a normal distribution to 
a free boundary problem related to the Wiener process was done also by Chernoff 
(1961, 1965, 1968); see also the book of Dynkin and Yushkevich (1969). 

A simpler sequential stopping rule for testing two simple hypotheses is 


N. = min{n > 1: 2(X,) <€ or m(X,) > 1-€}. (8.2.27) 


498 BAYESIAN ANALYSIS IN TESTING AND ESTIMATION 


If w(Xy) < € then Ap is rejected, and if m(Xy) > 1 — € then Ab is accepted. This 
stopping rule is equivalent to a Wald SPRT (A, B) with the limits 


qs en a Fe es 
~ (1-6e)(—7) ~ e1l—z) 


1 
If = —~ then, according to the results of Section 4.8.1, the average error probability 


is less than or equal to €. This result can be extended to the problem of testing k 
simple hypotheses (k > 2), as shown in the following. 

Let H,..., Hy be k hypotheses (k > 2) concerning the distribution of a random 
variable (vector) X. According to H;, the p.d.f. of X is fj(x;0),0 € O;,j =1,...,k. 
The parameter @ is a nuisance parameter, whose parameter space ©; may depend 
on H;. Let Gj(0), j = 1,...,k, be a prior distribution on ©;, and let 7; be the 

k 
prior probability that H; is the true hypothesis, Sia j = 1. Given n observations on 


j=l 
X1,..., Xn, which are assumed to be conditionally i.i.d., we compute the predictive 
likelihood of H;, namely, 
L (Xn) =) [ | £%i: dG), (8.2.28) 
Oj jy 
j=1,...,k. Finally, the posterior probability of H;, after n observations, is 
mw; Lj(Xp 
1 j(Xn) = sie =1,...,k. (8.2.29) 
Son jLi(Xn) 
i=l 


We consider the following Bayesian stopping variable, for some 0 < € < 1. 


N. =min{n, n>1: max wj(X,) = 1—e}. (8.2.30) 


jk 


Obviously, one considers small values of €, 0 < € < 1/2, and for such ¢€, there is a 
unique value j° such that 7 jo(Xv,.) = 1 — ¢. At stopping, hypothesis Ho is accepted. 
For each n > 1, partition the sample space 1” of X,, to (k + 1) disjoint sets 


DW = {Xp : 1j(%n) = 1 -€}, j=l,...,k 
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k 
and D’? = 4" — | Jo". As long as x, € D\” we continue sampling. Thus, Ne = 
j=l 


k 

minyn>1:x, € RE eke . In this sequential testing procedure, decision errors 
j=l 

occur at stopping, when the wrong hypothesis is accepted. Thus, let 6;; denote the 

predictive probability of accepting H; when H; is the correct hypothesis. That is, 


[o.e) 
by = 2 = Lj %n)d 1%). (8.2.31) 


Note that, for 7* = 1 — €, 7; (X,) > 2* if, and only if, 


Jt 


1 * 
1 Lj(Xn). (8.2.32) 
n* 


Y > mi Lin) < 


iA 
Let a; denote the predictive error probability of rejecting H; when it is true, i.e., 
aj= > 3; je 
iA 
k 
The average predictive error probability is a, = Son jQlj. 
j=l 


Theorem 8.2.1. For the stopping variable N,, the average predictive error proba- 
bility is @, < €. 


Proof. From the inequality (8.2.32), we obtain 


CO 
l—-x* a; Ty 
‘sD, if al een Lite) YS Face aut) 
‘i ; (8.2.33) 
l—x* 7; 
a ye 


gig 


* : 
1 Tj 


Summing over 7, we get 


iAj J idj J igj lAiAj 
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or 


Wjaj < —* eA —aj)— SS a 7 6j1- 


iZi ixj Aid 


Summing over j, we obtain 


Gy < —* > Sham —ai)— > > YS md. (8.2.34) 


J tAS jo iff LAIA] 


The first term on the RHS of (8.2.34) is 


1—x* 1—x* 

— Jo So mi = a) = —— J - a — 21 — @))) 

1 ope 8 f 
jo TAS j 


(8.2.35) 
1 —7* 
= —«-1d a), 
a 
The second term on the RHS of (8.2.34) is 
-)- ye ys m)8;) = -»)-> EC — 61) 
jo iff lAixj jo IA 
j jo olAj 
= —(k — 1)0; + Gy = —(k — 2)an. 
Substitution of (8.2.35) and (8.2.36) into (8.2.34) yields 
a, < "(1 -@) 
qr * 
or 
Aq <E 
QED 


Thus, the Bayes sequential procedure given by the stopping variable N, and the 
associated decision rule can provide an excellent testing procedure when the number 
of hypothesis k is large. Rogatko and Zacks (1993) applied this procedure for testing 
the correct gene order. In this problem, if one wishes to order m gene loci on a 
chromosome, the number of hypotheses to test is k = m!/2. 
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8.3 BAYESIAN CREDIBILITY AND PREDICTION INTERVALS 


8.3.1 Credibility Intervals 


Let F = {F(x;0);0 € ©} be a parametric family of distribution functions. Let H(@) 
be a specified prior distribution of 0 and H(@ | X) be the corresponding posterior 


distribution, given X. If @ is real then an interval (L,(X), L_(X)) is called a Bayes 
credibility interval of level 1 — a@ if for all X (with probability 1) 


Pu{L,(X) < 6 < La(X)| X}>1-a. (8.3.1) 


In multiparameter cases, we can speak of Bayes credibility regions. Bayes tolerance 
intervals are defined similarly. 

Box and Tiao (1973) discuss Bayes intervals, called highest posterior density 
(HPD) intervals. These intervals are defined as @ intervals for which the posterior 
coverage probability is at least (1 — a) and every @-point within the interval has 
a posterior density not smaller than that of any 6-point outside the interval. More 
generally, a region R7(X) is called a (1 — aw) HPD region if 


Gi) Pu(@ € Ry(X) | X] = 1 —a, for all X; and 
(ii) h(O | x) => h(@ | x), for every 6 € Ry(x) and @ ¢ Ry(x). 


The HPD intervals in cases of unimodal posterior distributions provide in nonsym- 
metric cases Bayes credibility intervals that are not equal tail ones. For various 
interesting examples, see Box and Tiao (1973). 


8.3.2 Prediction Intervals 


Suppose X is arandom variable (vector) having a p.d.f. f(x; 0), € ©. If @ is known, 
an interval [,,(6) is called a prediction interval for X, at level (1 — a) if 


Po{X € I,(0)} = 1 —a. (8.3.2) 


When @ is unknown, one can use a Bayesian predictive distribution to determine an 
interval [,(H) such that the predictive probability of {X € I,(H)} is at least 1 — a. 
This predictive interval depends on the prior distribution H(@). After observing 
X ,.--, Xn, one can determine prediction interval (region) for (Xn41,..., Xn+m) by 
using the posterior distribution H(@ | X,,) for the predictive distribution f(x | X,) = 


f(x; 0)dH(6 | x,). In Example 8.12, we illustrate such prediction intervals. For 
) 
additional theory and examples, see Geisser (1993). 
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8.4 BAYESIAN ESTIMATION 


8.4.1 General Discussion and Examples 


When the objective is to provide a point estimate of the parameter 0 or a function 
@ = g(@) we identify the action space with the parameter space. The decision function 
d(X) is an estimator with domain x and range ©, or Q = g(@). For various loss 
functions the Bayes decision is an estimator 6;,(X) that minimizes the posterior risk. 
In the following table, we present some loss functions and the corresponding Bayes 
estimators. 

In the examples, we derived Bayesian estimators for several models of interest, 
and show the dependence of the resulting estimators on the loss function and on the 
prior distributions. 


Loss Function Bayes Estimator 
(6-6 0(X) = En {0 | X} 

(The posterior expectation) 
Q(0)(6? — 6° En {0 Q(@) | X}/En{Q@) | X} 
es) 6(X) = median of the posterior 


distribution, i.e., H~!(.5 | X). 
a(6 — 0) +b@—6)* The —4, quantile of H@ | X); 
ie., H7'(-4 | X). 


8.4.2 Hierarchical Models 


Lindley and Smith (1972) and Smith (1973a, b) advocated a somewhat more compli- 
cated methodology. They argue that the choice of a proper prior should be based on the 
notion of exchangeability. Random variables W,, W2,..., Wx are called exchange- 
able if the joint distribution of (W;, ..., W,) is the same as that of (W;,,..., Wi,), 
where (i;,..., 7%) is any permutation of (1, 2,...,k). The joint p.d.f. of exchange- 
able random variables can be represented as a mixture of appropriate p.d-f.s of i.i.d. 
random variables. More specifically, if, conditional on w, W,,..., Wx are i.i.d. with 
k 


p.d.f. f(Wi,..., Wisw) = | [s™.. w), and if w is given a probability distribution 


i=1 


P(w) then the p.d.f. 


k 
Fi Wascong Wey / [ [Ova Po) (8.4.1) 
i=1 


represents a distribution of exchangeable random variables. If the vector X represents 
the means of k independent samples the present model coincides with the Model II 
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of ANOVA, with known variance components and an unknown grand mean jz. This 
model is a special case of a Bayesian linear model called by Lindley and Smith a 
three-stage linear model or hierarchical models. The general formulation of such a 
model is 


X ~ N(A)6), V), 

0, ~ N(A26>, ), 
and 

02. ~ N(A363, C), 
where X is ann x 1 vector, 0; are pj; x 1 (i = 1, 2,3), Aj, Az, A3 are known constant 
matrices, and V, }, C are known covariance matrices. Lindley and Smith (1972) 


have shown that for a noninformative prior for 02 obtained by letting C~! — 0, the 
Bayes estimator of 0, for the loss function L(6;, 0) = ||0; — 0,||*, is given by 


6, = B'A‘X, (8.4.2) 
where 
B=A\V'A,+ 01-9 1A (ASE 1A) ASE. (8.4.3) 
We see that this Bayes estimator coincides with the LSE, (A'A)!A'X, when V = I 
and ~~! — 0. This result depends very strongly on the knowledge of the covariance 
matrix V. Lindley and Smith (1972) suggested an iterative solution for a Bayesian 
analysis when V is unknown. Interesting special results for models of one way and 
two-way ANOVA can be found in Smith (1973b). 
A comprehensive Bayesian analysis of the hierarchical Model II of ANOVA is 
given in Chapter 5 of Box and Tiao (1973). 


In Gelman et al. (1995, pp. 129-134), we find an interesting example of a 
hierarchical model in which 


J, |0; ~ Bn, 9;), i=1,...,k. 
01,..., Ox are conditionally i.i.d., with 
6; |a, B ~ Beta(a,B), i=1,...,k 
and (a, 6) have an improper prior p.d.f. 


h(a, B) x (a + B)??. 
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According to this model, 0 = (6),..., 6) is a vector of priorly exchangeable (not 
independent) parameters. We can easily show that the posterior joint p.d.f. of 8, given 
J=(Ji,..., Jz) and (a, B) is 


k 


AO | J,e, B) = I] 


gttir! 


1-6,)P"—--!, (8.4.4 
Ceesn Ered ( ) one 


In addition, the posterior p.d.f. of (a, B) is 


Ba+ Jj, B +nj — Jj) 
Bq, B) 


g@, B| J) x g(a, pq] (8.4.5) 


j=l 


The objective is to obtain the joint posterior p.d.f. 


hA@|J/= i [ h(O| J, a, B)g(a, B | Jdadg 


ai fi g(a laa 7 etl — 9h dadg. 


From h(@ | J) one can derive a credibility region for 6, etc. 


8.4.3 The Normal Dynamic Linear Model 


In time-series analysis for econometrics, signal processing in engineering and other 
areas of applications, one often encounters series of random vectors that are related 
according to the following linear dynamic model 


Yn = Ady + En, 
(8.4.6) 
0, = GOn_1 + @n, n= 1, 


where A and G are known matrices, which are (for simplicity) fixed. {€,,} is a sequence 
of i.i.d. random vectors; {@,} is a sequence of i.i.d. random vectors; {€,,} and {@,} 
are independent sequences, and 

En Sa NO, V), 

@, ~ NO, Q). 


(8.4.7) 


We further assume that 0 has a prior normal distribution, i.e., 


80 ~ N(no, Co), (8.4.8) 
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and that 00 is independent of {e,} and {w,}. This model is called the normal random 
walk model. 

We compute now the posterior distribution of 0;, given Y;. From multivariate 


normal theory, since 


Y, | 0, ~ N(A@,, V), 
0; | Ao ~ N(G@o, Q), 


and 
90 ~ N(No, Co), 
we obtain 
0, ~ N(Gio, 2+ GCoG’). 
Let Fy = Q+ GCG". Then, we obtain after some manipulations 
0, | Yi ~ NG, Ci), 
where 
m = Gyo + FiA'TV + AF\A'T (Xi — AGno), (8.4.9) 
and 
Ci =F, — FjA' [V+ AFA’) 'AF,. (8.4.10) 
Define, recursively for j > 1 
Fj =Q+GCj-1G', 
and 


C;=F,-—F,A [V+ AFA‘) ‘AF; 
veg. on : : (8.4.11) 
n= Gy j-1 + F,A[V + AFA‘) (Y; 7 AG j-1). 


The recursive equations (8.4.11) are called the Kalman filter. Note that, for each 
n> 1, , depends on D,, = (Y1,..., Y,). Moreover, we can prove by induction on 


n, that 


On| Dn ~ NQn, Cn), (8.4.12) 
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for all n > 1. For additional theory and applications in Bayesian forecasting and 
smoothing, see Harrison and Stevens (1976), West, Harrison, and Migon (1985), and 
the book of West and Harrison (1997). We illustrate this sequential Bayesian process 
in Example 8.19. 


8.5 APPROXIMATION METHODS 


In this section, we discuss two types of methods to approximate posterior distributions 
and posterior expectations. The first type is analytical, which is usually effective in 
large samples. The second type of approximation is numerical. The numerical approx- 
imations are based either on numerical integration or on simulations. Approximations 
are required when an exact functional form for the factor of proportionality in the pos- 
terior density is not available. We have seen such examples earlier, like the posterior 
p.d.f. (8.1.4). 


8.5.1 Analytical Approximations 


The analytic approximations are saddle-point approximations, based on variations of 
the Laplace method, which is explained now. 
Consider the problem of evaluating the integral 


I = fof re rexp(—ntconiao, (8.5.1) 


where @ is m-dimensional, and k(@) has sufficiently high-order continuous partial 
derivatives. Consider first the case of m = 1. Let @ be an argument maximizing 
—k(@). Make a Taylor expansion of k(@) around 6, ie., 


k(0) = k(O) + (0 — O)K'(O) + 50 — 6)k"(6) +000 —6Y, as 636. (8.5.2) 


k'(9) = Oandk’(6) > 0. Thus, substituting (8.5.2) in (8.5.1), the integral J is approx- 
imated by 


4A Nine A\2 
= i, f@)exp {—nk@) — 5k" - 6) | ae 
—_— (8.5.3) 


20 
——; En { f ()}, 


= exp{—nk(@)} / ae) 


where Ey{f(@)} is the expected value of f(@), with respect to the normal dis- 


tribution with mean @ and variance on = 7 on The expectation Ey{f(@)} can 
n ‘i 
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be sometimes computed exactly, or one can apply the delta method to obtain the 
approximation 


— f(é i i) -3/2 
EWMfO}=f(O+ = vO + O(n"), (8.5.4) 


Often we see the simpler approximation, in which f(0) is used for Ey { f (6)}. In this 
case, the approximation error is O(n~'). If we use f(@) for Ey { f(@)}, we obtain the 
approximation 


oe NOD 
I & exp{—nk(6)} f 6) ao (8.5.5) 


In the m > 1 case, the approximating formula becomes 

A DRT Bch A F 

a (=) |£(0)|""" f 6) exp{—nk(6)}, (8.5.6) 
where 


2, 
yr '@) = (spark: i,j =1, a) (8.5.7) 


90,90; 


0=6 


These approximating formulae can be applied in Bayesian analysis, by letting —nk(0) 
be the log-likelihood function, /(6* | X,); 8 be the MLE, 6,,, and %7!(6) be J(6,,) 
given in (7.7.15). Accordingly, the posterior p.d.f., when the prior p.d.f. is h(0), is 
approximated by 


h@ | Xn) & Cy(X,)h@O) exp |-5@ ~6,) 16,0 — 6.) - (8.5.8) 


In this formula, 6, is the MLE of 6 and 


-1 
6.8) =| f--- [ mex {-50-dsGv0-6]40] 


m 


= [(27)? | JO.) Ev {A@}y. 


If we approximate Ey {h(@)} by h(6,,), then the approximating formula reduces to 


m/2 


h@ | Xn) = Onn 


|VOn)1"? exp {50 — bn)'IGn)- (0 —Bn)}. (8.5.10) 
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This is a large sample normal approximation to the posterior density of 6. We can 
write this, for large samples, as 


6|X,2N (4,. ~x(,)"") : (8.5.11) 
n 


Note that Equation (8.5.11) does not depend on the prior distribution, and is not 
expected therefore to yield good approximation to h(@ | X,,) if the samples are not 
very large. 

One can improve upon the normal approximation (8.5.11) by combining the 
likelihood function and the prior density 4(@) in the definition of k(@). Thus, let 


k(0) = Wo | X,,) + log h(0)). (8.5.12) 


Let 6 be a value of 6 maximizing —nk(6), or 6, the root of 


1 
Vol(O | Xn) + in) Ve h(O) = 90. (8.5.13) 
Let 
e 1 2 
jo) =Jo)— ; (sas log i)) (8.5.14) 


Then, the saddle-point approximation to the posterior p.d.f. h(6 | X,,) is 


h@)LO | Xn) 


ENN, MRE (8.5.15) 
hOn)LOn | Xn) 


m/2 _ 
h*(0 | X,)= (=) \56,)|17 - 


This formula is similar to Barndorff-Nielsen p*-formula (7.7.15) and reduces to the 
p*-formula if h(@)d0 « d@. The normal approximation is given by (8.5.11), in which 
6, is replaced by 6,, and J(6,,) is replaced by J(6,). 

For additional reading on analytic approximation for large samples, see Gamerman 
(1997, Ch. 3), Reid (1995, pp. 351-368), and Tierney and Kadane (1986). 


8.5.2 Numerical Approximations 


In this section, we discuss two types of numerical approximations: numerical inte- 
grations and simulations. The reader is referred to Evans and Swartz (2001). 
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I. Numerical Integrations 
We have seen in the previous sections that, in order to evaluate posterior p.d.f., one 
has to evaluate integrals of the form 


T= a L(O | X,)h(0)dé. (8.5.16) 


Sometimes these integrals are quite complicated, like that of the RHS of Equa- 
tion (8.1.4). 

Suppose that, as in (8.5.16), the range of integration is from —oo to oo and 
I < oo. Consider first the case where @ is real. Making the one-to-one transformation 
w = e°/(1 + e°), the integral of (8.5.16) is reduced to 


I a ico a (8.5.17) 
= og —— — 40, to's 
Poise tack TY Bie Ee 


where g(@) = L(@ | X,,)h(@). There are many different methods of numerical inte- 
gration. A summary of various methods and their accuracy is given in Abramowitz 
and Stegun (1968, p. 885). The reader is referred also to the book of Davis and 
Rabinowitz (1984). 

If we define f(@) so that 


1 
fo)=¢ (108 2) (8.5.18) 


w3/2(1 — w)'/2 


then, an n-point approximation to J is given by 
i = )0 pif), (8.5.19) 
i=l 


where 


(8.5.20) 
_ 20 iy 
Eee Gea 1T=1,...,n. 
The error in this approximation is 
= Qn) 
Gees ee ee (8.5.21) 


Rk, = ———— 
(2n)!240+1 
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Integrals of the form 


1 1 1 
my gan = [ giudu + | q(—-u)du. (8.5.22) 
-1 0 0 


Thus, (8.5.22) can be computed according to (8.5.19). Another method is to use an 
n-points Gaussian quadrature formula: 


i= aqui), (8.5.23) 
i=1 


where u; and w; are tabulated in Table 25.4 of Abramowitz and Stegun (1968, 
p. 916). Often it suffices to use n = 8 orn = 12 points in (8.5.23). 


IT. Simulation 
The basic theorem applied in simulations to compute an integral J = / f(0)dH(@) 


is the strong law of large numbers (SLLN). We have seen in Chapter | that if 
X 1, X2,... iS a sequence of i.i.d. random variables having a distribution F(x), and 
CO 


it [ |g(x)|dF(x) < oo then 


1 = a.s. a 
CO bares / g(x)dF(x). 
i=l hake 


(oe) 


This important result is applied to approximate an integral / f(0)dH(0) by a 


—0o 
sequence 6;, 02,... of ii.d. random variables, generated from the prior distribution 
H(@). Thus, for large n, 


(oe) os 1 n 
i f(@)dH(0) = — > f@). (8.5.24) 


Computer programs are available in all statistical packages that simulate realizations 
of a sequence of 1.i.d. random variables, having specified distributions. All programs 
use linear congruential generators to generate “pseudo” random numbers that have 
approximately uniform distribution on (0, 1). For discussion of these generators, see 
Bratley, Fox, and Schrage (1983). 

Having generated i.i.d. uniform R(O, 1) random variables U;, U2,..., Un, one 
can obtain a simulation of i.i.d. random variables having a specific c.d.f. F, by the 
transformation 


X,;=F'(U;), i=1,...,n. (8.5.25) 
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In some special cases, one can use different transformations. For example, if U;, U2 
are independent R(0, 1) random variables then the Box-Muller transformation 


X, = (—2log U,)!” cos(2 U2), 
(8.5.26) 
X> = (—2log U,)"”” sin(27 Up), 


yields two independent random variables having a standard normal distribution. It is 
easier to simulate a N(0, 1) random variable according to (8.5.26) than according to 
X = 6~!(U). In today’s technology, one could choose from a rich menu of simulation 
procedures for many of the common distributions. 
If a prior distribution H(@) is not in a simulation menu, or if h(O)d6O is not 
[o.e) 


proper, one can approximate / f (0)h(O)d0 by generating 0), ..., 0, from another 


—0o 
convenient distribution, A(@)d@ say, and using the formula 


Ai) 
00;) 


/ * ¢@)hO)d0 ~ (8.5.27) 


The method of simulating from a substitute p.d.f. A(@) is called importance sampling, 
and 2(@) is called an importance density. The choice of 4(@) should follow the 
following guidelines: 


(i) The support of 4(@) should be the same as that of h(6); 
(ii) A(@) should be similar in shape, as much as possible, to 1(); 1.e., A(@) should 
have the same means, standard deviations and other features, as those of h(@). 


The second guideline is sometimes complicated. For example, if h(@)d(0) 
is the improper prior dO and I SiC f(@)d@, where ‘Z | f(@)|d@0 < ow, one 
—0o 


could use first the monotone transformation x = e*/(1 + e’) to reduce J to J = 


1 

d 

/ f { log —— i = One can use then a beta, 6(p, g), importance density 
0 1—x/ xd —-x) 


to simulate from, and approximate J by 


fe des X; B(p,q) 
l=- ] E 
7S (ere) XP = Xi 


It would be simpler to use 6(1, 1), which is the uniform R(0, 1). 
An important question is, how large should the simulation sample be, so that the 
approximation will be sufficiently precise. For large values of n, the approximation 
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~ Yat 


~) 
N - a where 
n 


— ae f(0;)h(0;) is, by Central Limit Theorem, approximately distributed like 


1” = Vs{ f@)h@)}. 


Vs(-) is the variance according to the simulation density. Thus, n could be chosen 


sufficiently large, so that Zj_y/2 - 5 < 6. This will guarantee that with confidence 
n 


probability close to (1 — a) the true value of J is within f + 5. The problem, however, 
is that generally rt? is not simple or is unknown. To overcome this problem one could 
use a sequential sampling procedure, which attains asymptotically the fixed width 
confidence interval. Such a procedure was discussed in Section 6.7. 

We should remark in this connection that simulation results are less accurate 
than those of numerical integration. One should use, as far as possible, numerical 
integration rather than simulation. 

To illustrate this point, suppose that we wish to compute numerically 


t= exp{ 5} ax = 
Reduce /, as in (8.5.17), to 
I= yh v| 5 (ive : )| ie , 
Jon l—-u u(1 —u) 
Simulation of N = 10, 000 random variables U; ~ R(O, 1) yields the approximation 


1 10,000 \ ee j 
fio.000 = —————= ex (10 : ) : 
10, 000/27 d, | 7a aes ary a Cae) 


= 1.001595. 


On the other hand, a 10-point numerical integration, according to (8.5.29), yields 
fio = 1. 


When @ is m-dimensional, m > 2, numerical integration might become too difficult. 
In such cases, simulations might be the answer. 
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8.6 EMPIRICAL BAYES ESTIMATORS 


Empirical Bayes estimators were introduced by Robbins (1956) for cases of repetitive 
estimation under similar conditions, when Bayes estimators are desired but the statis- 
tician does not wish to make specific assumptions about the prior distribution. The 
following example illustrates this approach. Suppose that X has a Poisson distribution 
P(A), and 4 has some prior distribution H(A), 0 < A < oo. The Bayes estimator of 
A for the squared-error loss function is 


/ dp(X; A)dH(A) 
E{a| X}=-2 


[x )dH(A) 
0 


where p(x; A) denotes the p.d.f. of P(A) at the point x. Since Ap(x;A) = (x + 1)- 
p(x + 1;A) for every A and each x = 0,1,... we can express the above Bayes 
estimator in the form 


/ * pax + 1,A)dH(A) 
0 
/ * px ;A)dH(A) (8.6.1) 
0 


= (X + l)pa(X + 1)/pu(X), 


Ex{a| X}=(X+)) 


where py(x) is the predictive p.df. at x. Obviously, in order to determine 
the posterior expectation we have to know the prior distribution H(A). On 
the other hand, if the problem is repetitive in the sense that a sequence 
(X1,A1), (X2, A2),-.-, (Xn, An), ..., 1S generated independently so that A1,Az2,... 


are 1.i.d. having the same prior distribution H(A), and X1,..., X, are conditionally 
independent, given A;,..., A,, then we consider the sequence of observable random 
variables X,,..., Xn, ...aSi.i.d. from the mixture of Poisson distribution with p.d.f. 


Po(J), j = 90, 1, 2,.... Thus, if on the nth epoch, we observe X,, = i° we estimate, 


on the basis of all the data, the value of px(i 041) /PuG °). A consistent estimator 
1 n 

of pu(/), for any 7 = 0, 1,... is —S°1{X; = j}, where /{X; = j} is the indicator 
n 


i=l 
function of {X; = j}. This follows from the SLLN. Thus, a consistent estimator of 
the Bayes estimator Ey {A | X,} is 


SOX: =X,+]} 
A, = (Xt D545 (8.6.2) 


DONEX = Xn} 
i=1 
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This estimator is independent of the unknown H(A), and for large values of n is 
approximately equal to Ey {A | X,}. The estimator A, is called an empirical Bayes 
estimator. The question is whether the prior risks, under the true H(A), of the esti- 
mators A, converge, as n — ov, to the Bayes risk under H(A). A general discussion 
of this issue with sufficient conditions for such convergence of the associated prior 
risks is given in the paper of Robbins (1964). 

Many papers were written on the application of the empirical Bayes estimation 
method to repetitive estimation problems in which it is difficult or impossible to 
specify the prior distribution exactly. We have to remark in this connection that the 
empirical Bayes estimators are only asymptotically optimal. We have an adaptive 
decision process which corrects itself and approaches the optimal decisions only 
when n grows. How fast does it approach the optimal decisions? It depends on the 
amount of a priori knowledge of the true prior distribution. The initial estimators may 
be far from the true Bayes estimators. A few studies have been conducted to estimate 
the rate of approach of the prior risks associated with the empirical Bayes decisions to 
the true Bayes risk. Lin (1974) considered the one parameter exponential family and 
the estimation of a function A(@) under squared-error loss. The true Bayes estimator is 


tay = fr@peroraney | feet, 


and it is assumed that A(x) fu(x) can be expressed in the form Yoo, (x) f, less where 
i=0 
2 (x) is the ith order derivative of f(x) with respect to x. The empirical Bayes 
estimators considered are based on consistent estimators of the p.d.f. f(x) and its 
derivatives. For the particular estimators suggested it is shown that the rate of approach 
is of the order O(n~*) with 0 < a < 1/3, where n is the number of observations. 

In Example 8.26, we show that if the form of the prior is known, the rate of 
approach becomes considerably faster. When the form of the prior distribution is 
known the estimators are called semi-empirical Bayes, or parametric empirical 
Bayes. 

For further reading on the empirical Bayes method, see the book of Maritz (1970) 
and the papers of Casella (1985), Efron and Morris (1971, 1972a, 1972b), and Susarla 
(1982). 

The E-M algorithm discussed in Example 8.27 is a very important procedure for 
estimation and overcoming problems of missing values. The book by McLachlan and 
Krishnan (1997) provides the theory and many interesting examples. 
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Example 8.1. The experiment under consideration is to produce concrete under 
certain conditions of mixing the ingredients, temperature of the air, humidity, etc. 
Prior experience shows that concrete cubes manufactured in that manner will have a 
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compressive strength X after 3 days of hardening, which has a log-normal distribution 
LN(, 07). Furthermore, it is expected that 95% of such concrete cubes will have 
compressive strength in the range of 216-264 (kg/cm’). 
According to our model, Y = log X ~ N(w, 0”). Taking the (natural) logarithms 
of the range limits, we expect most Y values to be within the interval (5.375,5.580). 
The conditional distribution of Y given (i, a”) is 


Y | u,o? ~ N(u, 0°”). 


Suppose that o7 is fixed at o* = 0.001, and yz has a prior normal distribution 4 ~ 
N(j10, T7), then the predictive distribution of Y is N(t0, o* +7’). Substituting 49 = 
5.475, the predictive probability that Y € (5.375, 5.580), if /o2 +t? = 0.051 is 
0.95. Thus, we choose t? = 0.0015 for the prior distribution of jw. 

From this model of Y | w,07 ~ N(j, 07) and w ~ N({19, T7). The bivariate dis- 


tribution of (Y, jz) is 
y Lo ear 
~N > 9 2 ‘ 
2 Mo T T 


Hence, the conditional distribution of jz given {Y = y} is, as shown in Section 2.9, 


v2 o 
Y=y~N(pot a ~ N(219 + 0.6y, 0.0006). 
Le | y (1 ae =2\Y — Ho), T= a =) ( y ) 
The posterior distribution of jz, given {Y = y} is normal. a 
Example 8.2. (a) X), X2,..., Xn given A are conditionally 1.i.d., having a Poisson 


distribution P(A), i.e., F = {P(A),0 < 2X < oo}. 
Let H = {G(A, a),0 <a, A < ov}, ie., H is a family of prior gamma distribu- 


tions for A. The minimal sufficient statistics, given A, is T, = yx i- Tn |X ~ Pn). 


i=1 


Thus, the posterior p.d.f. of A, given T,,, is 


ROT) ope Oa ee 


— pInta-1,-MntA) 


Hence, A | T, ~ G(n+ A, T, + @). The posterior distribution belongs to H. 
(b) F = {G(, a), 0 < 2d < ov}, a fixed. H = {G(A, v), 0 < v, A < ov}. 


REN Be eae 


— pvta-1,-MX+A) 


Thus, A | X ~ G(X + A, v+a). | 
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Example 8.3. The following problem is often encountered in high technology 
industry. 

The number of soldering points on a typical printed circut board (PCB) is often very 
large. There is an automated soldering technology, called “wave soldering,” which 
involves a large number of different factors (conditions) represented by variables 
X 1, X2,---, Xx. Let J denote the number of faults in the soldering points on a 
PCB. One can model J as having conditional Poisson distribution with mean 1, 
which depends on the manufacturing conditions X;,..., X; according to a log-linear 
relationship 


k 
log a = Bo + >) Bixi = B’x, 


i=1 


where B’ = (Bo, ..., By) and x = (1, x1,..., x,). B is generally an unknown para- 
metric vector. In order to estimate B, one can design an experiment in which the 
values of the control variables X1,..., X, are changed. 


Let J; be the number of observed faulty soldering points on a PCB, under control 
conditions given by x; (i = 1, ..., N). The likelihood function of B, given Jj, ..., Jy 
and x},..., Xj, 1S 


N 
LB | Si, .--s Ins Xs ++. Xu) = | [exp(ixiB — 8 *) 


i=l 


N 
= exp{w), B — Sey, 


i=1 


N 
where Wy = So Sixi. If we ascribe 6 a prior multinormal distribution, i.e., 


i=l 
B ~ N(Bo, V) then the posterior p.d.f. of B, given Dy = (Jj,..., Jn, X1,---,Xw)> 
is 


a | 
WB | Dy) x exp ) WB — 9&8? — (8 — Bo V'(B — Ao . 


i=1 


It is very difficult to express analytically the proportionality factor, even in special 
cases, to make the RHS of h(B | Dy) a p.d-f. a 


Example 8.4. In this example, we derive the Jeffreys prior density for several 
models. 


A. F = {b(xin, 0),0 <@ <1}. 
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This is the family of binomial probability distributions. The Fisher information 
function is 


1 
I1@) = ———,,_ 0<6<1. 
(0) a 8) <O0< 


Thus, the Jeffreys prior for 6 is 
he)«a 'a—ey'?, 0<8<1. 


In this case, the prior density is 
I -iy2 -1/2 
h(@)=—e 1-6)", O0<6@<1. 
4 


This is a proper prior density. The posterior distribution of 6, given X, under the 
1 1 
above prior is Beta( x + 5 n-X+ 3) 
B. F ={N(p, 07); -00 < pp < 00,0 <o < oo}. 
The Fisher information matrix is given in (3.8.8). The determinant of this matrix 
is |I(z, o7)| = 1/20°. Thus, the Jeffreys prior for this model is 


h(y, a”) x GU ayaa 


Using this improper prior density the posterior p.d.f. of (u, 07), given X1,..., Xn, 
is 


nie xg3 M(U-Xn) +O) 


r(+)r (5) 2°82 
2) 2 


7 1 n n _ 1 

where X,, = -5°Xi, Q= SX _ 66 The parameter @ = a is called the pre- 
= i=l - 

cision parameter. In terms of jz and ¢, the improper prior density is 


h(u,07 | Xn, Q)= 


h(n, go) xo, 


The posterior density of (41, 6) correspondingly is 


JNO Fem BlH-Xn P+ Ol 


r(i r(5)2 
2)° \2 = 


hw, | Xn, Q) = 
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Example 8.5. Consider a simple inventory system in which a certain commodity 
is stocked at the beginning of every day, according to a policy determined by the 
following considerations. The daily demand (in number of units) is a random variable 
X whose distribution belongs to a specified parametric family F. Let X1, X2,... 
denote a sequence of i.i.d. random variables, whose common distribution F(x; 6) 
belongs to F and which represent the observed demand on consecutive days. The 
stock level at the beginning of each day, S,,n = 1,2, ... can be adjusted by increasing 
or decreasing the available stock at the end of the previous day. We consider the 
following inventory cost function 


K(s,x) =c(s —x)t —h(s — x), 


where c, 0 < c < &, is the daily cost of holding a unit in stock andh,0 <h < @ 
is the cost (or penalty) for a shortage of one unit. Here (s — x)* = max(0, s — x) 
and (s — x)” = — min(0, s — x). If the distribution of X, F(x; 0) is known, then the 
expected cost R(S, 0) = Eg{K(S, X)} is minimized by 


‘Qe zz ee ) 
S\(0)=F 30], 
cth 


where F~!(y;@) is the y-quantile of F(x; 6). If @ is unknown we cannot determine 
5°(6). We show now a Bayesian approach to the determination of the stock levels. 
Let H(@) be a specific prior distribution of 6. The prior expected daily cost is 


pS, H) = R(S, A)h(@)dé, 
(2) 


or, since all the terms are nonnegative 


a(S, m= [oly resoynts.on| a 
9 x=0 


=> K(S.x) | f(x: 0)h(0)d0. 
(2) 


x=0 


The value of § which minimizes p(S, H) is similar to (8.1.27), 


S°(H) = F;! (=). 
Noth 


i.e., the h/(c + h)th-quantile of the predictive distribution F(x). 
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After observing the value x; of X,, we convert the prior distribution H(@) to a 
posterior distribution (0 | x;) and determine the predictive p.d.f. for the second 
day, namely 


fino) = f f:0yhu@ | xd. 


The expected cost for the second day is 


p(S2, H) = )> K(So, y) ful). 


y=0 


Moreover, by the law of the iterated expectations 


fa”) = > fin |x) fa@). 


x=0 
Hence, 
[oe [oe 
p(Sr, H) = Y> ful) ¥> K(S2, y) fay (y | x). 
x=0 y=0 
[o.e) 
The conditional expectation \oK (So, y) fu,(y | x) is the posterior expected cost 
y=0 


given X, = x; or the predictive cost for the second day. The optimal choice of Sz given 
X, = x is, therefore, the h/(c + h)-quantile of the predictive distribution F'y,(y | x) 


1:6:; SSH )=F ph aah | x }. Since this function minimizes the predictive risk 
c 
for every x, it minimizes p(S2, H). In the same manner, we prove that after n days, 


given X, = (x1,...,X,) the optimal stock level for the beginning of the (m + 1)st 


day is the i -quantile of the predictive distribution of X,,.), given X,, = X,, i.e., 


Cc 


h 
Ss) — ee (= | x) where the predictive p.d.f. of X,,41, given X, = x is 


ful 1X) = [ FO%8)ME | 00, 


and h(@ | x) is the posterior p.d.f. of @ given X,, = x. The optimal stock levels are 
determined sequentially for each day on the basis of the demand of the previous days. 
Such a procedure is called an adaptive procedure. In particular, if X,, X2,... is a 
sequence of i.i.d. Poisson random variables (r.v.s), P(@) and if the prior distribution 
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1 
H(@) is the gamma distribution, G (+. »), the posterior distribution of 6 after n 
T 

1 n 
observations is the gamma distribution G (- +n,v+ mr) , where 7, = yx ;. Let 

T 

i=l 

g(O|n+ 1, v + T,) denote the p.d.f. of this posterior distribution. The predictive 
distribution of X,,4; given X,, which actually depends only on 7, is 


1 22. ; 1 
f261D) = | &°8g C pt det t.) do 
y! 0 T 


Pwt+yt+T,) 


= y v+T, 
= Pot broth) Way", 


where yy, = T/(1 + (1+ 1)t). This is the p.d.f. of the negative binomial N B(vy,, 
v+T,,). It is interesting that in the present case the predictive distribution belongs 
to the family of the negative-binomial distributions for alln = 1, 2,.... We can also 
include the case of n = 0 by defining 7) = 0. What changes from one day to another 
are the parameters (y,, v + T,,). Thus, the optimal stock level at the beginning of the 
(n + 1)st day is the h/(c + h)-quantile of the NB(w,, v + Th). | 


Example 8.6. Consider the testing problem connected with the problem of detect- 
ing disturbances in a manufacturing process. Suppose that the quality of a product is 
presented by a random variable X having a normal distribution N(0, 1). When the 
manufacturing process is under control the value of 6 should be 69. Every hour an 
observation is taken on a product chosen at random from the process. Consider the 
situation after n hours. Let X;,..., X, be independent random variables represent- 
ing the n observations. It is suspected that after k hours of operation 1 <k <na 
malfunctioning occurred and the expected value @ shifted to a value 6; greater than 
6o. The loss due to such a shift is (6; — 60) [$] per hour. If a shift really occurred the 
process should be stopped and rectified. On the other hand, if a shift has not occurred 
and the process is stopped a loss of K [$] is charged. The prior probability that the 
shift occurred is y. We present here the Bayes test of the two hypotheses 


Ho: X1,..., Xn are iid. like N(Oo, 1) 


against 


A :X,,..., X_ are iid. like N(@o, 1) and 
Xkit,---,Xn arei.id. like N(O,, 1), 


for a specified k, 1 < k <n — 1; which is performed after the nth observation. 
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The likelihood functions under Hp and under Hj are, respectively, when X,, = x, 


1 n 
Lo(Xn) = exp {-5 > @ - a : 
i=l 


and 
ifs 2 . 2 
Ls) = 09 |-5] Dor —a + Soo] 
Thus, the posterior probability that Ho is true is 
(Xn) = WLo(Xn)/{T Lo(&n) + UI — W)Li&n)}, 


where 2 = | — yw. The ratio of prior risks is in the present case Kz/((1 — 2)(n — 
k)(@, — 90)). The Bayes test implies that Hp should be rejected if 


= A +O, 1 i Kx 
te. 5 (n — k)(O, — 4) e (1 — r)(n — k)(, — 4)’ 


where X*_, = —— Xj. 
The Bayes (minimal prior) risk associated with this test is 
p(x) = wKeg(r) + (1 — )(n — Kk) — )€1(77), 


where €9(zr) and €;(z7) are the error probabilities of rejecting Hp or H; when they are 
true. These error probabilities are given by 


€o(7) = Pa, [Xi > ores + An-a(t)| 
6; — A 
=1-® (va —k (54 + An-u(a))) , 


where ®(z) is the standard normal integral and 


An-K() = 


1 i K “if =) 
aoa cm (aca ona) 


Similarly, 


E(t) =1—-® (va =F (Ho = An-u(t))) , 
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The function A,_,(7) is monotone increasing in 7 and lim. Ayn_x(T) = —CO, 
lim An—x(77) = oo. Accordingly, €9(0) = 1,€1(0) =Oandeg(l)=OQeg(Q)=1. g 


Example 8.7. Consider the detection problem of Example 8.6 but now the point of 
shift k is unknown. If @) and 6; are known then we have a problem of testing the 
simple hypothesis Hp (of Example 8.6) against the composite hypothesis 


Hit Xi REY NOD Rie Xe PNY for FS nS 


Let zg be the prior probability of Hp and z;, j = 1,..., — 1, the prior probabilities 
under H; that {k = j}. The posterior probability of Ho is then 


n—1 


1 Tj 
Mo(X,) = (1 Li j) - 
o(Xn) (+ BL pve D 


fe —1 
exp| = =| Ae, — 6) + (1 - Nx, — AY — (Xn - «)]}) 


af ie 1 : 
where Xj; = ~S0X;, j=l,...,n and Xn j = A X;. The posterior 


Fiz cae 


i=n—jt+l 
probability of {k = j}is, for j =1,...,n—1, 


- | 
x= Mo(Xn) avin= i) 


exp {5 | 2c) 0)? + (1-2) he, 8) - hs a |f. 


Let R;(X,,) (i = 0, 1) denote the posterior risk associated with accepting H;. These 
functions are given by 


n-1 
Ro(Xn) = 101 — 40] Yom — fT j(Xn), 
j=l 


and 


Ri(Xn) = KI1o(X,), 
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Ap is rejected if R,(X,,) < Ro(X,,), or when 


n—1 : 
Len = aii Dew | 5(Xj — O° (1 2) cx 00%} 
j=l 
KroJ/n n= 2 
> gag PL ge — 0} 


Example 8.8. We consider here the problem of testing whether the mean of a 
normal distribution is negative or positive. Let X;,..., X, bei.i.d. random variables 
having a N(@, 1) distribution. The null hypothesis is Hp : 8 < 0 and the alternative 
hypothesis is H; : 6 > 0. We assign the unknown @ a prior normal distribution, 1.e., 
6 ~ N(O, t7). Thus, the prior probability of Hp is z = 7 The loss function Lo(@) of 
accepting Hp and that of accepting Hj), L,(0), are of the form 


0, if@ <0, 6°, if6 <0, 
b= {02 ife>0, m@= tf if > 0. 


For the determination of the posterior risk functions, we have to determine first the 
posterior distribution of 6 given X,,. Since X, is a minimal sufficient statistic the 
conditional distribution of 6 given X,, is the normal 


Wie eh, 
"L4+7r2’n1l4r? 


(See Example 8.9 for elaboration.) It follows that the posterior risk associated with 
accepting Ho is 


= andl Elfen fe, n 1 ai x) 
Ro(Xn) = Vie 6 exp| 5 (1+ =z) 0(X,,)) | ao, 


where 6(X,,) is the posterior mean. Generally, if X ~ N(é, D7) then 


1 se 1 2 2 2 5 5 
———(x — = D*)® | — Dé|—). 
= | x exp ape * ~ §) jas Gre tees 
Substituting the expressions 


1 


é Ly 
anya a 
5 ( 7 =z) ca n(l + t2/n) 
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we obtain that 
7 1\771 . 7 1 \-l2 
Ro(X,) = {1+ > —+X?)@( Vn X,(1+— 
nt? n nt? 
x, 1 \73/ : 1 \71 
1+— X,{1l+— > : 
tae) OE) 
In a similar fashion, we prove that the posterior risk associated with accepting H is 
ie wan! eee F es tas 
Ri(X,) = (1+ = —~+ Xi) o(—VJn X,(1+— 
Tn n nt? 
x, 1 \73 7 1 \71?2 
- 1+— X,{1+— > : 
eae) AAR (ae) 


The Bayes test procedure is to reject Hy whenever R,(X,,) < Ro(X,). Thus, Ho 
should be rejected whenever 


(- - =) [20 (va Xi (: + =) | - ] 
2 (Meg) “o(n(tesh)) 


But this holds if, and only if, X,, > 0. | 


Example 8.9. Let X), Xo, ... bei.i.d. random variables having a normal distribution 
with mean jz, and variance o* = 1. We wish to test k = 3 composite hypotheses 
H,:-w<pw<-—l; Ay: -l<u<1; A: u>1. Let uw have a prior normal 
distribution, 4 ~ N(O, t). Thus, let 


reas) mea(-o 


1 
and zm; =1—©® (<). be the prior probabilities of H_;, Ho, and Hj, respectively. 
T 


Furthermore, let 


D(u/T) 


1 ’ 
Gi(u)= 1 (-) 
T 


1, ifu>-—tl, 


if p21; 
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0, if uw < —1, 
1 
® H)-e(-- 
Go(u) = - T\? if-l<p<l, 
J6RUG 
T: G 
1, if > 1, 


and 


The predictive likelihood functions of the three hypotheses are then 
L_x,) 1 ce l+nt?+nrt?X, n n 3 
(Xn) = . ex a 
1 iene Ttane- Pl 20 +nr?) 
a 1 1 > —nt?X, 1 a Ks 
Lo(X,) = (o( +nt*—nt )+o/ +nt>+nt ) 1) 
0 tV¥1+nt? tV¥1+nt? 


aa z xX 
Vitae? | 204272)?” 


and 
L(%,) 1 iy 1+nt? —nt?X, n n 2 
n= : ex : 
a wT tat? V1¢ne2 P| 20 402)" 
It follows that the posterior probabilities of H;, 7 = —1, 0, | are as follows: 


err? Gnas a 
i o( +nt7(1+ ’), fsa, 


tV14+nt2 


= 1+nt7(1— =) (- +anrrit =) : 
(Xn) = (a +@® 1, = 0, 
- tV1+nt2 tV1+nt2 ‘ 


1 201 — X,, 
1 o( +nt*( ’), Se 
tV1+nt2 
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Thus, 7_1(X,) > 1—e if 


x 
IA 
+ 


Similarly, 7\(X,) > 1 — if 


V1 + 1/nt? 
Jnt 


2 1 
X, 21+ ee ee 
nt 


1 | 1 
Thus, if b, = 1+ aor Zi-e,/1+ —;/tJ/n then b, and —b, are outer stopping 
nT nt 


boundaries. In the region (—b,, b,), we have two inner boundaries (—c,,, c,) such 
that if |X,,| < c, then Hp is accepted. The boundaries +c, can be obtained by solving 
the equation 


(A) sh (et) 
ae aera = €. 
tV14+nt2 tV1+nt2 
v1 cue) € 
=1 5 or No = 


T 


2 Ie 
Zia : a 


Example 8.10. Consider the problem of estimating circular probabilities in the 
normal case. In Example 6.4, we derived the uniformly most accurate (UMA) lower 
confidence limit of the function 


Wo? p) =1- Ep |P (Jie). 
202 


1 1 
where J isaNB (: == 5) random variable for cases of known p. We derive here 
p 


Cn =O and c, > 0 only if n > no, where o 


the Bayes lower creas limit of es p) for cases of known p. The minimal suf- 


ficient statistic is To, = ox? + e 5Ue 2 This statistic is distributed like o *y¥7[2n] 
i=1 


1 1 
or like G (=. n). Let w = =} and let @ ~ G(t, v). The posterior distribution of 
o o 


, given 7, is 


w | Thn ~ G(T, +Ta+ v). 
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Accordingly, if G~!(p | Ton + tT, + v) designates the pth quantile of this posterior 
distribution, 


Pla > Ga | Tn +7,n+Vv) | Ton} =1—a, 


with probability one (with respect to the mixed prior distribution of T>,). Thus, we 
obtain that a 1 — w Bayes upper credibility limit for 0? is 


-2 i 

Oo = 

, 2G-\(a | Trp +T,na+ v) 
Toy +T 


~ 2G-l(a | ijn+v)’ 


Note that if t, and v are close to zero then the Bayes credibility limit is very close 
to the non-Bayes UMA upper confidence limit derived in Example 7.4. Finally, the 
(1 — a) Bayes lower credibility limit for wW(o7, p) is W(e, p). | 


Example 8.11. We consider in the present example the problem of inverse regres- 
sion. Suppose that the relationship between a controlled experimental variable x and 
an observed random variable Y(x) is describable by a linear regression 


Y(x) =a+ Bx +e, 


where € is a random variable such that E{e} = 0 and E{e*} = 07. The regression 
coefficients a and f are unknown. Given the results on 1 observations at x1, ..., Xn, 
estimate the value of € at which E{Y(&)} = 7, where n is a preassigned value. We 
derive here Bayes confidence limits for § = (7 — a)/, under the assumption that m 
random variables are observed independently at x; and x2, where x2 = x; + A. Both 
x; and A are determined by the design. Furthermore, we assume that the distribution 
of € is N(O, 07) and that (a, 6) has a prior bivariate normal distribution with mean 
(ao, Bo) and covariance matrix V = (v;;;i, j = 1, 2). For the sake of simplicity, we 
assume in the present example that 0” is known. The results can be easily extended 
to the case of unknown o?. 

The minimal sufficient statistic is (Y,, Y2) where Y; is the mean of the m observa- 
tions at x; (i = 1, 2). The posterior distribution of (a, B) given (Y;, Y2) is the bivariate 
normal with mean vector 


2 ch 7 
a, \ _ ( a Sas r\ (Yi — @o + Box1) 
Ga 7 (Gi) + vx (Sr+xvx’) Gases) 


where 


528 BAYESIAN ANALYSIS IN TESTING AND ESTIMATION 


and I is the 2 x 2 identity matrix. Note that X is nonsingular. X is called the design 
matrix. The covariance matrix of the posterior distribution is 


2 -1 
~=V—vx' (r+ xvx’) XV. 
m 


Let us denote the elements of X by ¥j;, i, j = 1,2. The problem is to determine 
the Bayes credibility interval to the parameter € = (7 — a)/B. Let e. and &, denote 
the limits of such a (1 — a) Bayes credibility interval. These limits should satisfy the 
posterior confidence level requirement 


If we consider equal tail probabilities, these confidence limits are obtained by solving 
simultaneously the equations 


where D = Yy1 + 26, Y12 + Yoo and similarly D = Yy1 + 2& X12 + &y Yn. By 
inverting, we can realize that the credibility limits & . and &, are the two roots of the 
quadratic equation 


(1 — ay — Big)” = xf_gFu + 28 Fo + £7 En), 
or 
A&é* —2BE+C=0, 
where 


A = fe —_ Xi_gll] Xz, 
B=B—in-—my +x? 111 ¥n, 
C=(n-a) — x7 (¥u. 
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The two roots (if they exist) are 


(n — 1) + X7_g LL) E12/Bi 

Bi — X7_, (1) ¥22/B: 
, (x?_g MD)! 7{@ = a1)? Zi + 2(y — 1) 61 B12 + BP Yoo — xP__ CUI}? 
- Bi — x7_g (22 


fi2= 


’ 


where |X| denotes the determinant of the posterior covariance matrix. These credi- 
bility limits exist if the discriminant 


A* =(n- a, px" a) — Xel1]| I 


is nonnegative. After some algebraic manipulations, we obtain that 


’ 


o2 o2 
Ix| = —|V]- (< +rixvx') I-— XVX' 
m m 


where tr{-} is the trace of the matrix in { }. Thus, if m is sufficiently large, A* > 0 
and the two credibility limits exist with probability one. a 


Example 8.12. Suppose that X,,..., X,4) are 1.id. random variables, having a 
Poisson distribution P(A), 0 < 7 < oo. We ascribe A a prior gamma distribution, i.e., 
A~ G(A, @). 


n 
After observing X;,..., X,, the posterior distribution of 4, given 7, = Pes i 


i=l 
isA|T, ~ G(A +n, T, + a). The predictive distribution of X,+1, given T;,, is the 
negative-binomial, i.e., 


Xn4+1 | T,, ee NB, T, + a), 


where 


A+n 


US ha 


Let NB~'(p; w, a) denote the pth quantile of N B(y, a). The prediction interval for 
Xn+1, after observing X1,..., X,, at level 1 — a, is 


(vB (Sv T +a) NB" (1-530 T, +«)) 
2° no n ’ 9 ns n byl 


According to Equation (2.3.12), the pth quantile of NB(y, a) is NB~'(p | w,a) = 
least integer k, k > 1, such that [_y(a,k +1) > p. | 
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Example 8.13. Suppose that an n-dimensional random vector X,, has the multi- 
normal distribution X, | uw ~ N(t1yn, Vn), where —oo < fu < oo is unknown. The 
covariance matrix V, is known. Assume that jw has a prior normal distribution, 
u~ N(uo0, tT’). The posterior distribution of jz, given X,,, is “| X, ~ N(7(Xp), Dn), 
where 


(Xn) = Ho + TU, (Va + 7 In) (Xn — Holn) 
and 
Dy = 07(1 — 071, (Vq + 07 Jp) 1p). 


Accordingly, the predictive distribution, of yet unobserved m-dimensional vector 
Ym | ~ NI, Vin), ts 


Yn | Xn o N(n(Xn) 1m, Vin ci Dy Jn). 
Thus, a prediction region for Y,,, at level (1 — a) is the ellipsoid of concentration 


(Yin — 1(Xn)1n) (Vn + Dn Jn) (Ym — 1(Xn)lm) < xX7_g[ml. 


Example 8.14. A new drug is introduced and the physician wishes to determine 
a lower prediction limit with confidence probability of y = 0.95 for the number of 
patients in a group of n = 10 that will be cured. If X,, is the number of patients cured 
among n and if @ is the individual probability to be cured the model is binomial, 
ie., X, ~ B(n, 8). The lower prediction limit, for a given value of 0, is an integer 
k,,(@) such that Pe{X, > k,(@)} = y. If B-'(p;n, @) denotes the pth quantile of the 
binomial B(n, @) then k, (6) = max(0, B-'(1 — y;n, 0) — 1). Since the value of 6 is 
unknown, we cannot determine k,,(@). Lower tolerance limits, which were discussed 
in Section 6.5, could provide estimates to the unknown k,,(6). A statistician may 
feel, however, that lower tolerance limits are too conservative, since he has good a 
priori information about 6. Suppose a statistician believes that 6 is approximately 
equal to 0.8, and therefore, assigns @ a prior beta distribution 6(p,q) with mean 
0.8 and variance 0.01. Setting the equations for the mean and variance of a B(p, q) 
distribution (see Table 2.1 of Chapter 2), and solving for p and qg, we obtain p = 12 
and q = 3. We consider now the predictive distribution of X, under 6(12, 3) prior 
distribution of 0. This predictive distribution has a probability function 


eee : 
», J=,...,n. 


Pu) = ( B(12.3) 
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For n = 10, we obtain the following predictive p.d.f. py(j) andc.d.f. f(j). Accord- 
ing to this predictive distribution, the probability of at least 5 cures out of 10 patients 
is 0.972 and for at least 6 cures is 0.925. 


J Pu(J) Fu(j) 
0 0.000034 0.000034 
1 0.000337 0.000378 
2 0.001790 0.002160 
3 0.006681 0.008841 
4 0.019488 0.028329 
5 0.046770 0.075099 
6 0.094654 0.169752 
7 0.162263 0.332016 
8 0.231225 0.563241 
9 0.256917 0.820158 
10 0.179842 1.000000 


Example 8.15. Suppose that in a given (rather simple) inventory system (see Exam- 
ple 8.2) the monthly demand, X of some commodity is a random variable having a 
Poisson distribution P(@), 0 < 6 < oo. We wish to derive a Bayes estimator of the 
expected demand @. In many of the studies on Bayes estimator of 6, a prior gamma 
distribution G (+. ») is assumed for 0. The prior parameters t andv,0 < T,v < ~, 
are specified. Note that the prior expectation of @ is vt and its prior variance is vt’. 
A large prior variance is generally chosen if the prior information on @ is vague. 
This yields a flat prior distribution. On the other hand, if the prior information on 
@ is strong in the sense that we have a high prior confidence that @ lies close to a 
value 9 say, pick vt = 6 and vt? very small, by choosing t to be small. In any 


case, the posterior distribution of 6, given a sample of n i.i.d. random variables 
n 


X 1,.--,; Xn, is determined in the following manner. 7, = yx ; 1S a minimal suf- 
ficient statistic, where 7, ~ P(n@). The derivation of the posteHiot density can be 
based on the p.d.f. of 7;,. Thus, the product of the p.d.f. of 7, by the prior p.d.f. of 
@ is proportional to 6'+”—!e-%@+1/0), where T, = t. The factors that were omitted 
from the product of the p.d.fs are independent of 0 and are, therefore, irrelevant. 
We recognize in the function 6’+’~!e~°"+!/ the kernel (the factor depending on 
0) of a gamma p.d.f. Accordingly, the posterior distribution of 6, given T,, is the 


gamma distribution G | n + —, T, + v }. If we choose a squared-error loss function, 
T 
then the posterior expectation is the Bayes estimator. We thus obtain the estimator 


, 1 
6=(7T, + v)/ (: + ): Note that the unbiased and the MLE of 6 is T,,/n which is 
T 
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not useful as long as T,, = 0, since we know that 6 > 0. If certain commodities have a 
very slow demand (a frequently encountered phenomenon among replacement parts) 
then T,, may be zero even when n is moderately large. On the other hand, the Bayes 
estimator 0 is always positive. a 


Example 8.16. (a) Let X,,..., Xy be iid. random variables having a normal 
distribution N(0, 1), —coo < @ < oo. The minimal sufficient statistic is the sample 
mean X. We assume that @ has a prior normal distribution N(0, tT”). We derive the 
Bayes estimator for the zero-one loss function, 


L(@, 0) = 1{0;|6 — O| = 5}. 
The posterior distribution of 6 given X is normal N(X(1 + 1/nt?)~!, (n + 1/t?)7!). 
This can be verified by simple normal regression theory, recognizing that the joint 


distribution of (X, @) is the bivariate normal, with zero expectation and covariance 
matrix 


Thus, the posterior risk is the posterior probability of the event {|9 — | > 6}. This is 
given by 


64+5—X(1+1/nt?)"! 
I —1/2 
¢ + =) 


6—5-xd+4)! 


1 —1/2 


We can show then (Zacks, 1971; p. 265) that the Bayes estimator of 6 is the posterior 
expectation, i.e., 


R(6,t7) =1-© 


+ @ 


l +1 
0X) = R(14 ie 


nt? 


In this example, the minimization of the posterior variance and the maximization of 
the posterior probability of covering 6 by the interval (9 — 5, 6 + 4) is the same. This 
is due to the normal prior and posterior distributions. 
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(b) Continuing with the same model, suppose that we wish to estimate the tail 
probability 


Ww = Po{X > Fo} = 1— Oo — 0) = OG — 5). 


Since the posterior distribution of 6 — &) given X is normal, the Bayes estimator for 
a squared-error loss is the posterior expectation 


1\7) 
X(1+—]}) -& 
Exy{®@ — &)| X}=@ 2 \ie 
oe 
a a) 


Note that this Bayes estimator is strongly consistent since, by the SLLN, X > 6 
almost surely (a.s.), and ®(-) is a continuous function. Hence, the Bayes estimator 
converges to ®(@ — &) a.s. as n > ov. It is interesting to compare this estimator 
to the minimum variance unbiased estimator (MVUE) and to the MLE of the tail 
probability. All these estimators are very close in large samples. 

If the loss function is the absolute deviation, |v — y|, rather than the squared-error, 
(yw — w), then the Bayes estimator of w is the median of the posterior distribution 
of ®(6 — 9). Since the ®-function is strictly increasing this median is ®(65 — &9), 
where 60.5 is the median of the posterior distribution of 6 given X. We thus obtain 
that the Bayes estimator for absolute deviation loss is 


ue ox (1+) -a). 


This is different from the posterior expectation. a 


ou 8.17. In this example, we derive Bayes estimators for the parameters jz and 
o” in the normal model N (jz, a) for squared-error loss. We assume that X1,..., Xn, 
given (Hs ?) are iid. N (u, o*). The minimal sufficient statistic is (X,, Q), where 


= =!yrx, and Q = iM - X,)°. Let @ = 1/o? be the precision parameter, 
i=1 i=1 
and consider the reparametrization (j1, a7) > (u, @). 
The likelihood function is 


L(u, ¢) = 6" exp {Stu — X,) + oi} . 


The following is a commonly assumed joint prior distribution for (jz, 6), namely, 


“|b ~ N(uo, t7/), 
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and 


o~a(S.2), no > 2, 


where no is an integer, and 0 < w < oo. This joint prior distribution is called the 
a fo aw 

7 2°2) 
conditionally independent, given (uw, @), and since the distribution of Q is independent 
of jz, the posterior distribution of yz | X,, ¢ is normal with mean 


Normal-Gamma prior, and denoted by NG (1. T Since X,, and Q are 


ae 1 P ae 
ate eo a 1+t2n " 
and variance 
Vinee 
ee One) 


fig is a Bayesian estimator of yz for the squared-error loss function. The posterior 
risk of fig is 


Vie | Xn, O} = E{V{u | Xn, $} | X, O} + V{E{u | Xn, } | X, Q}. 


The second term on the RHS is zero. Thus, the posterior risk of fig is 


5 tT Ape 
Viu| X, Q} = ae {5 1%. 9}. 


The posterior distribution of ¢ depends only on Q. Indeed, if we denote generally by 
p(X | 4, ¢) and p(Q | @) the conditional p.d.f. of X and Q then 


h(u,o| X, O) x p(X | 1, d)p(Q | o)h(u | o)g8(¢) 
x h(w| X,o)p(Q | ¢)g). 


Hence, the marginal posterior p.d.f. of @ is 


n°(p | X, Q) = I hu, | X, Od 


x p(Q | d)g(¢). 
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Thus, from our model, the posterior distribution of ¢, given Q, is the gamma 


-—1 
o(See te ) te tottows that 


1 Pa lee _ wt@ 
efgiea]=elgio)— ites 


Thus, the posterior risk of fig is 


Ue Q+y 
l+nt? n+n—-3 


Vin | X, O}= 


The Bayesian estimator of o? is 


a2 L fo OED 
o-E{7 19-23" 


The posterior risk of on is 


v {519} = (O+Wy (O+Wy 
od ~ (ntno —3)\n+no—5) (n+no — 3)? 
270+) 


~ (n+ 19 — 3°(n +19 — 5)’ 
which is finite ifn + no > 5. | 


Example 8.18. Consider the model of the previous example, but with priorly inde- 
pendent parameters jz and ¢, i.e., we assume that h(w, d) = h(w)g(d), where 


ss 1 1 2 
h(w) = | 7,2 u)"| 


and 
80) = —s 
T 


If p(X | w, @) and p(Q | ¢) are the p.d.f. of X and of Q, given ju, $, respectively, 
then the joint posterior p.d.f. of (u, 6), given (X, Q), is 


h(u, o | X, Q) = A(X, Q)p(X | w, 6)p(Q | MA()g(¢), 
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where A(X, Q) > O is a normalizing factor. The marginal posterior p.d.f. of wu, given 


(X, Q), is 
h*(u | X, Q) = A(X, Q)h(u) f P(X | 1, d)p(Q | d)gp)do. 


It is straightforward to show that the integral on the RHS of (8.4.18) is proportional 
ng+n 
2 


(u— xP | . Thus, 


n 


Q+y 


to E 


ators 
n 


, : 1 
h*(w | X, Q) = A(X, Qvexp | 354 wot ‘ + on x] 


A simple analytic expression for the normalizing factor A*(X, Q) is not available. 
One can resort to numerical integration to obtain the Bayesian estimator of jz, namely, 


2 ROee 


i u- ¥F © dp. 


O+y 


CO 
(ip = A*(X, Q) / pe BE Hey E fi 
i 


By the Lebesgue Dominated Convergence Theorem 


im} Lx phe: (iia Ge ee oe 
exp }—~—~(u — . - 
jim, J wexpy— zal — Ho Ory a 
0 1 
-| wexp |— 51 — no)" lim - 
Vhs 2T n>oo 
ngotn 
E tS yl a 
b- LL 
Q+w 


Thus, for large values of n, 


Be’ ® 
“ tT 662 
| 1 
—+— 
tT 6? 
+ 
wheeee = 2 Las 


n 
In a similar manner, we can show that the marginal posterior p.d.f. of @ is 
cas -] 


* % = B*(YX bs 


(X-—py—¢ 


a= 
ae 
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where B*(X, Q) > 0 is a normalizing factor. Note that for large values of n, 9*(@ | 
Q+wv notn-1 

a 2 . 

In Chapter 5, we discussed the least-squares and MVUESs of the parameters in linear 
models. Here, we consider Bayesian estimators for linear models. Comprehensive 
Bayesian analysis of various linear models is given in the books of Box and Tiao 
(1973) and of Zellner (1971). The analysis in Zellner’s book (see Chapter IIT) follows 
a straightforward methodology of deriving the posterior distribution of the regression 
coefficients for informative and noninformative priors. Box and Tiao provide also 
geometrical representation of the posterior distributions (probability contours) and 
the HPD-regions of the parameters. Moreover, by analyzing the HPD-regions Box and 
Tiao establish the Bayesian justification to the analysis of variance and simultaneous 
confidence intervals of arbitrary contrasts (the Scheffé S-method). In Example 8.11, 
we derived the posterior distribution of the regression coefficients of the linear model 
Y=a+ Px +e, where € ~ N(0, 07) and (a, B) have a prior normal distribution. 
In a similar fashion the posterior distribution of B in the multiple regression model 
Y = Xf + € can be obtained by assuming that € ~ N(0, V) and the prior distribution 
of B is N(Bo, B). By applying the multinormal theory, we readily obtain that the 
posterior distribution of B, given Y, is 


X, Q) is approximately the p.d.f. of G 


B|Y ~ N(Bo + BX'(V + XBX’) !(Y — XB), B — BX'(V + XBX’) | XB). 


This result is quite general and can be applied whenever the covariance matrix V is 


known. Often we encounter in the literature the NG (6%. rT, os *) prior and the 
(observations) model 


Y |B. ~ N(XB, (1/)1) 


and 


2 
B | Bo.?~N (60. =), and o~6 (5.2). 


This model is more general than the previous one, since presently the covariance 
matrices V and B are known only up to a factor of proportionality. Otherwise, the 


models are equivalent. If we replace V by a where V* is a known positive definite 


matrix, and ¢, 0 < ¢ < ov, is an unknown precision parameter then, by factoring 
V* = C’C”, and letting Y* = (C*)~'Y, X* = (C*)~!X we obtain 


1 
Y*|B,O6~N{X*B, I). 
pom (0.4) 
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Similarly, if B = DD’ and B* = D~'B then 
pio~n (85,51). 
p 
If X** = X*D then the previous model, in terms of Y* and X**, is reduced to 
y= xX" pr +e’, 


where Y* = C~!Y, X* = C~'XD, B* = D“'B, V = CC, andB = DD’. 
We obtained a linear model generalization of the results of Example 8.17. Indeed, 


BLY, ~ N(Bo + X'(+ t°XX’)|(¥ — Xo), sit — X'(I+ 1?XX’) 'X)). 
Thus, the Bayesian estimator of 8, for the squared-error loss, || — B ||? is 
By = (1— X04 XX’) 'X)Bo + XU + XX. 
As in Example 8.17, the conditional predictive distribution of Y, given @, is normal, 
Y|O~N (xpo. sil+ xx) 


Hence, the marginal posterior distribution of ¢, given Y, is the gamma distribution, 
1.€., 


1 ; gals? n+n 
bIY~ G (Ser + 0Y - XBoC+ PXX) '(Y — XBo)), 5 *), 
where n is the dimension of Y. Thus, the Bayesian estimator of @ is 


5 _ n+no 
PW +(Y — XBo)( + 12XX)-1(¥ — XBo)’ 


Finally, if % =o then the predictive distribution of Y is the multivariate 
t[no; XBo, 1 + t7XX’], defined in (2.13.12). r] 


Example 8.19. The following is a random growth model. We follow the model 
assumptions of Section 8.4.3: 


X,=O1 +t +e, t=1,2,..., 


PART II: EXAMPLES 539 


where 6), and 6), vary at random according to a random-walk model, i.e., 


00.1 60,1-1 @0,t 
ty F de ie) 
cc) (ay i) 
Thus, let 6, = (00,7, 01,,)’ and a) = (1, t). The dynamic linear model is thus 


X,=a0,+6 
6, = 0;_-1 + @;, | — el ne eee 


Let 7, and C, be the posterior mean and posterior covariance matrix of 6,. We 
obtain the recursive equations 


1 

= Mm-1+ — (Xr = a) s—1)(Cy-1 + Q)a;, 
t 

r, =o +al(C,1 + Qha,, 


1 / 
C; = Cy) + Q— el ar Q)a,a,(Cy-1 + Q), 
t 


where o” = V{e,} and Q is the covariance matrix of @,. | 


Example 8.20. In order to illustrate the approximations of Section 8.5, we apply 
them here to a model in which the posterior p.d.f. can be computed exactly. Thus, 
let X,,..., X, be conditionally ii.d. random variables, having a common Poisson 
distribution P(A), 0 < 4 < ow. Let the prior distribution of 4 be that of a gamma, 


G(A, a). Thus, the posterior distribution of 4, given T,, = OX is like that of 
i=l 


G(in+ A,a+ T,), with p.d-f. 


A)ethn 
hy(& | T,) = et A yattntg MOH) 2 < 00. 
l(a + T,) 
The MLE of A is A, = T, /n. In this model, Jan) — n/[Xpn. Thus, formula (8.5.11) 
yields the normal approximation 


* n Vv | 
WA| Th) = —(A—X,)*t. 


Jn | 
— exp 
/27 - Xn 2Xn 


From large sample theory, we know that 


Gn+A,a+T™)—-hn a 
s1/2 i 
An 
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540 
Thus, the approximation to h(A | T,,) given by 
2 
A\2 

vi(1+2) (1+) aes 
hia | Th) = exp Ee Xr nt 
% a 2 Y a A 
2n (X, + £) (%,+=) fee 
n 


should be better than (A | T,,), if the sample size is not very large. 


Example 8.21. We consider again the model of Example 8.20. In that model, for 


0<irA <a, 
Z 7 (a — 1) 

kA) =A-—X,-logA +A— — fe) 
n 


A = a-l 
=A{1+—]—-{X, + —— ]logd. 
n n 


Thus, 
x4 a-—1l 
fe A n 
Kajy= (1 + ) 7 
n Xr 
and the maximizer of —nk(A) is 
S a—l 
i Xn + 
An = re 
VS 
Moreover, 
(Hn) 
1+ — 
ee n 
i arr a 
X, + 
n 
The normal approximation, based on An and J (An)s is 
2 a-l —- a—l 
Xn t Xnt 
n n This is very close to the large sample 


A? 2 

(fa= 
approximation (8.5.15). The only difference is that w, in h(A | T,), is replaced by 
a 


a’ =a-l1. 
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Example 8.22. We consider again the model of Example 8.3. In that example, 
Y; |A; ~ PQ), where 4; = eXiB i =1,...,n. Let (X) = (K,..., X,)’ be the 
n X p matrix of covariates. The unknown parameter is B € R). The prior distri- 
bution of B is normal, i.e., 8B ~ N(Bo, t71). The likelihood function is, according to 
Equation (8.1.8), 


L(B; Y,(X)) = exp {wip Soe | 
i=1 


where W,, = Y CY; X;. The prior p.d.f. is, 


i=1 


h(B) = 


1 / 
P| 5-3 (B — Bo)'(B port. 


(2n)P/2¢ 07 °* 


Accordingly, 


y 1 aS 5 1 
a ae. 1 = x; B ! 
k(B) = 2 (ms De 53 (B — Bo) (B fo) 
Hence, 


2% 1 ie 1 
k(B) = —-W,, + — ) e*®x; + —(B - Bo), 
vek(B) a poe) ee 72 Bb Bo) 


i=1 


and 
J(B) LS exe pis aba 
=—) e*xx, + —l. 
Wee "nt? 
The value B,, is the root of the equation 


B= Bott?W, — 77 So e® x; 


i=1 


Note that 


. 1 
J(B) = GU + (XY A(B)(X)), 
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where A(B) is ann x n diagonal matrix with ith diagonal element equal to e? * (i = 
1,...,n). The matrix J() is positive definite for all B € R. We can determine B,, 
by Solving the equation iteratively, starting with the LSE, B* = [(X)'(X)]- ne;4 yY*, 
where Y;* is a vector whose ith component is 


log(¥;), if Yj > 0, 
t= i=l,...,n. 
—exp{x;Bo}, if Yi =0, 


The approximating p.d.f. for h(B | (X), Y,,) is the p.d.f. of the p-variate normal 
ts) all ee 
N (2. —(J(B,,))! ). This p.d.f. will be compared later numerically with a p.df. 
n 


obtained by numerical integration and one obtained by simulation. a 


Example 8.23. Let (X;, Y;), i= 1,...,n be iid. random vectors, having a stan- 
dard bivariate normal distribution, i.e., (X, Y)) ~ N (0 le i )). The likelihood 
function of p is 


L(e | Th) = 


1 r 
d= pe “| 7 a at on] 


n n n 

where T, = (Qx, Pxy, Qy) and Qx = ee Oy = er Pxyy = So Xi¥,. The 
i=l i=l i=l 

Fisher information function for p is 


1+ p° 
I,(e) = n—— 
(p) nee 
Using the Jeffreys prior 
1 2y1/2 
nop) = SFP, -1< p<, 


the Bayesian estimator of ¢ for the squared error loss is 


iE ( + p72 
=p) 
[ d+ py)? 


1 
exp| 30 — py l2x 2pPxy + Ovi dp 
Kappy exp | 21 — py 


pp = 


[Oe= Pas ovi| re 
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This estimator, for given values of Oy, Qy and Pyy, can be evaluated accurately by 
16-points Gaussian quadrature. For n = 16, we get from Table 2.54 of Abramowitz 
and Stegun (1968) the values 


uj | 0.0950 0.2816 0.4580 0.6179 0.7554 | 
Q; | 0.1895 0.1826 0.1692 0.1496 0.1246 | 


Uj 0.8656 0.9446 0.9894 
Qj 0.0952 0.0623 0.0272 


For negative values of u, we use —u; with the same weight, w;. For a sample 
of size n = 10, with Oy = 6.1448, Oy = 16.1983, and Pyy = 4.5496, we obtain 
Pp = 0.3349. = 


Example 8.24. In this example, we consider evaluating the integrals in Ag of 
Example 8.23 by simulations. We simulate 100 random variables U; ~ R(—1, 1) 
i= 1,..., 100, and approximate the integrals in the numerator and denominator of 
fo by averages. For n = 100, and the same values of Qx, Qy, Pxy, as in Example 
8.23, we obtain the approximation 6g = 0.36615. a 


Example 8.25. We return to Example 8.22 and compute the posterior expectation 
of B by simulation. Note that, for a large number M of simulation runs, E{B | X, Y} 
is approximated by 


M n 
we exp [wa — ad 


j=l i=l 


M n 
Sep {wi - Soe} 
j=l i=l 


E= 


> 


where B ; is a random vector, simulated from the N(Bo, ral ) distribution. 
To illustrate the result numerically, we consider a case where the observed sample 
contains 40 independent observations; ten for each one of the for x vectors: 


x S26 10: 
XxX) = (1, 1,-1,-1), 
S02); 
x. Sieh 
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The observed values of w4o is (6, 3, 11,29). For Bj = (0.1, 0.1, 0.5, 0.5), we 
obtain the following Bayesian estimators E, with M = 1000, 


Simulation Normal Approximation 
B t=0.01 tT = 0.05 B t=0.01 tT = 0.05 
B, 0.0946 0.0060 B, 0.0965 0.0386 
B, 0.0987 0.0842 B, 0.0995 0.0920 
B; 0.4983 0.4835 B; 0.4970 0.4510 
By 0.5000 0.5021 Ba 0.4951 0.4038 


We see that when t = 0.01 the Bayesian estimators are very close to the prior 
mean By. When t = 0.05 the Bayesian estimators might be quite different than 
Bo. In Example 8.22, we approximated the posterior distribution of B by a normal 
distribution. For the values in this example the normal approximation yields similar 
results to those of the simulation. a 


Example 8.26. Consider the following repetitive problem. In a certain manufactur- 
ing process, a lot of N items is produced every day. Let M;, j = 1,2,..., denote 
the number of defective items in the lot of the jth day. The parameters M,, M2,... 
are unknown. At the end of each day, a random sample of size n is selected without 
replacement from that day’s lot for inspection. Let X ; denote the number of defectives 
observed in the sample of the jth day. The distribution of X; is the hypergeometric 
A(N, M;,n), j = 1,2,....Samples from different days are (conditionally) indepen- 
dent (given M, M2, ...). In this problem, it is often reasonable to assume that the 
parameters M,, Mo, ... are independent random variables having the same binomial 
distribution B(N, @). @ is the probability of defectives in the production process. It is 
assumed that 6 does not change in time. The value of 6 is, however, unknown. It is 
simple to verify that for a prior B(N, @) distribution of M, and a squared-error loss 
function, the Bayes estimator of M; is 


A 


M; = X;+(N —n)é. 
The corresponding Bayes risk is 
p(0) = (N — njé(1 — @). 


A sequence of empirical Bayes estimators is obtained by substituting in M ja con- 
sistent estimator of @ based on the results of the first (j — 1) days. Under the above 
assumption on the prior distribution of M,, Mz,..., the predictive distribution of 
X 1, X2,... is the binomial B(n, @). A priori, for a given value of 0, X,, X2,... can 
be considered as i.i.d. random variables having the mixed distribution B(n, @). Thus, 
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j-l 
pjii= ape i, for j > 2, is a sequence of consistent estimators of 6. The 
i=l 


corresponding sequence of empirical Bayes estimators is 


A 
R 


M; =X;+(N —n)pj-1, j>=2. 


The posterior risk of M; given (X;, P;-1) is 


pj(M;) = E{[M; — M;¥ | pj-1, Xj} 
= E{[Mj — Mj} | pj-1, Xj} + (Mj — Mj 


= (N —n)0(1 — 6) + (M; — Mj)’. 
We consider now the conditional expectation of p; (M j) given X ;. This is given by 


E{p;(M;) | Xj} = (N —n)O(1 — 8) + (N — ny Ef[j-1 — OF} 


N-n 


Notice that this converges as j — o0 to e(@). a 


Example 8.27. This example shows the application of empirical Bayes techniques to 
the simultaneous estimation of many probability vectors. The problem was motivated 
by a problem of assessing the readiness probabilities of military units based on 
exercises of big units. For details, see Brier, Zacks, and Marlow (1986). 

A large number, N, of units are tested independently on tasks that are classified into 
K categories. Each unit obtains on each task the value | if it is executed satisfactorily 
and the value 0 otherwise. Leti,i = 1,..., N be the index of the ith unit, and j, j = 
1,...,k the index of a category. Unit i was tested on M;; tasks of category j. Let X;; 
denote the number of tasks in category j on which the ith unit received a satisfactory 
score. Let 6;; denote the probability of the ith unit executing satisfactorily a task of 


category j. There are N parametric vectors 0; = (6;1,...,9x)',i =1,..., N, tobe 
estimated. 
The model is that, conditional on 0;, X;1,..., X;x are independent random vari- 


ables, having binomial distributions, i.e., 


Xi; | 0; ~ BCMi;, 9:;). 
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In addition, the vectors 0; (i = 1,..., N) are i.i.d. random variables, having a com- 
mon distribution. Since Mj; were generally large, but not the same, we have used first 
the variance stabilizing transformation 


Xie +3/8 
Y;; =2sin! ccc cA Paves GS kK. 
Mi; + 3/4 


For large values of M;; the asymptotic distribution of Y;; is N(ni;, 1/Mj;), where 
nij = 2 sin~!(,/);), as shown in Section 7.6. 

Let Y; = (V1, ..., Yix)’,i = 1,..., N. The parametric empirical Bayes model is 
that (Y;,0;) areiid,i=1,...,N, 


Y¥; |0;~ N(yi, Di), i=1,...,N 


and 
M1,---,9n areiid. N(m, ¥), 
where 9; = (qi1,---, Nix) and 
1 
0 
Mii 
D; = ’ i = 1, ? N 
1 
0 
M; 


The prior parameters wz and ¥ are unknown. Note that if w and ¥ are given then 
N,12,---,w are a posteriori independent, given Y;,..., Yy. Furthermore, the pos- 
terior distribution of n;, given Y;, is 


ni | Yi, M, E~ N(Y; — BY; — #), UZ — B,)Dj), 
where 
B; =D,D;+ B)', i=1,...,N. 


Thus, if wz and ¥ are given, the Bayesian estimator of ;, for a squared-error loss 
function L(n;, #;) = ||#; — nil’, is the posterior mean, i.e., 


(eX) = Bat —-B)Y;,, i=1,...,N. 
The empirical Bayes method estimates w and ¥ from all the data. We derive now 


MLEs of « and Y. These MLEs are then substituted in 4;(j4, £2) to yield empirical 
Bayes estimators of 77;. 


PART II: EXAMPLES 547 


Note that Y; | uw, Y ~ N(w, £ +D,;), i = 1,..., N. Hence, the log-likelihood 
function of (uw, £), given the data (Y;,..., Yw), is 


fe ce 
Wm, 2) = —5 ) log |¥ + Dil— 5 i — WCE + D's — w). 
i=1 


i=l 
The vector &(Z), which maximizes /(u, £), for a given Lis 


N 
j(%) = (dep,") Sh + DY; 
i=1 


i=l 


Substituting (YZ) in /(w, X) and finding ¥ that maximizes the expression can yield 
the MLE (ft, £). Another approach, to find the MLE, is given by the E-M algorithm. 
The E-M algorithm considers the unknown parameters 7), ..., ny aS missing data. 
The algorithm is an iterative process, having two phases in each iteration. The first 
phase is the E-phase, in which the conditional expectation of the likelihood function 
is determined, given the data and the current values of (mw, %). In the next phase, 
the M-phase, the conditionally expected likelihood is maximized by determining the 
maximizing arguments (ft, £). More specifically, let 


N 
* N 1 ty-l1 
PE | Mis. Mis Vises ¥w) =z log El — 5 DOG — WE" — w) 


i=1 


be the log-likelihood of (a, E) if 41,...,y were known. Let (uw), £”) be the 
estimator of (a, £) after p iterations, p > 0, where w©, £ are initial estimates. 
In the (p + 1)st iteration, we start (the E-phase) by determining 


(a, E | ¥1,...,¥n, wp, E”) 


= Ef{l*(p, X |m.---,mv,Yi,-.-,¥n)|Yu,..., Yn, w™, ¥?)} 


N 

N 1 oe 
Ss WES > EX(n: — wy EG — #) | Xi, wh, TZ}, 
i=1 


where the conditional expectation is determined as though w and £”” are the true 
values. It is well known that if E{X} = & and the covariance matrix of X is C(X) then 
E{X’AX} = p'Ap + tr{AC(X)}, where tr{-} is the trace of the matrix (see, Seber, 
1977, p. 13). Thus, 


Ely — BYE — w) | Yi WE} = OW?” — wy EW,” — pw) 


aie ® Fama bce 
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where, W\” = 9(u), £), in which BY” = D,(D; + £”)-!, and VI” = BI? E, 
i=1,..., N. Thus, 


N N a 
PP, Ie Vina, Vn BEY) = — 5 log |¥1 — tev) 


N 
1 1 a 
— 5 cw,” — wy ECW,” — w), 
i=1 


N 
S 1 
where V'?) = RON In phase-M, we determine w+” and £°*)) by maximiz- 
i=l 
ing /™*(w, E|---). 
One can immediately verify that 


N 
w 1 
pet) = Ww) = >> wy”. 
i=l 
Moreover, 

#0 (DEL) () x) N N yl 1 7) 
PM (pPr’ =, E|Vi,..., Yn, w”, & Sai ea = 3 ttt (Cc? + Vj, 
Waele = (ey. j’=1,..., K), and 

1 N 
(Pp) _ (p) T7(P) (p) 77(P) 
Cit = 57 LWP — WP Me — WP). 
i=1 


Finally, the matrix maximizing /** is 
yer) = CY) + Vy), 


We can prove recursively, by induction on p, that 
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N 
- 1 
where B® = youB =0,1,..., and 


i=l 
1 N 
K= — B,Y; — BY). 
no ) 


Thus, 
lim W? =Y=(1-B)'K 
prow 


N 
= (du - »») you — B)Y;:. 
i=l i=l 


One continues iterating until w”) and £”) do not change significantly. 

Brier, Zacks, and Marlow (1986) studied the efficiency of these empirical Bayes 
estimators, in comparison to the simple MLE, and to another type of estimator that 
will be discussed in Chapter 9. a 
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Section 8.1 


8.1.1 Let F = {B(N,0);0 <0 < 1} be the family of binomial distributions. 
(i) Show that the family of beta prior distributions H = {6(p, q);0 < p, 
q < &} is conjugate to F. 
(ii) What is the posterior distribution of 6 given a sample of n 1.i.d. random 
variables, having a distribution in F? 


(iii) What is the predicted distribution of X41, given (X1,..., Xn)? 


8.1.2 Let X,,..., X, beii.d. random variables having a Pareto distribution, with 
p.d.f. 


fsv)=vA"/x"!, 0< A<x < 00; 
0 < v < w (A is a Specified positive constant). 


4 1/n 
(i) Show that the geometric mean, G = (Tx ) , 1S aminimal sufficient 
i=l 


statistic. 


(ii) Suppose that v has a prior G(A, p) distribution. What is the posterior 
distribution of v given X? 


(iii) What are the posterior expectation and posterior variance of v given X? 
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Let X be a p-dimensional vector having a multinormal distribution N(j, %). 
Suppose that Zis known and that w has a prior normal distribution N (to, V). 
What is the posterior distribution of w given X? 


Apply the results of Problem 3 to determine the posterior distribution of 8 
in the normal multiple regression model of full rank, when o? is known. 
More specifically, let 


X ~ N(AB, 071) and B ~ N(Bo, V). 


(i) Find the posterior distribution of B given X. 


(ii) What is the predictive distribution of X, given X,, assuming that con- 
ditionally on B, X, and X, are 1.i.d.? 


Let X,,..., X, be iid. random variables having a Poisson distribution, 
P(A), 0 < 4 < oo. Compute the posterior probability P{A > X,, | X,} cor- 
responding to the Jeffreys prior. 


Suppose that X;,..., X, are i.i.d. random variables having a N(0, a”) dis- 


tribution, 0 < 0? < oo. 


1 
(i) Show that if 1/207~ G (<. ’) then the posterior distribution of 
T 


1/207 given § = DX? is also a gamma distribution. 
(ii) What is E{o* | S} and V{o? | S} according to the above Bayesian 
assumptions. 


Let X1,..., X, be ii.d. random variables having a N(u, o”) distribution; 
—0o < fh < 00, 0 <o < oo. A normal-inverted gamma prior distribution 
for pL, o? assumes that the conditional prior distribution of jz, given o?, is 


1 
N(to, Ao”) and that the prior distribution of 1/207 is a G (<. ). Derive 
T 


the posterior joint distribution of (1, o”) given (X, S ) where X and S? are 
the sample mean and variance, respectively. 


Consider again Problem 7 assuming that jz and o? are priorly indepen- 
1 

dent, with up ~ N({1o, D) and 1/20? ~G (-. »), What are the posterior 
T 


expectations and variances of jz and of 07, given (X, S*)? 


Let X ~ Bin, 6), 0 < 6 < 1. Suppose that 6 has a prior beta distribution 
BG, 5). Suppose that the loss function associated with estimating 0 by 6 is 
LQ(@, @). Find the risk function and the posterior risk when L(0, @) is 

() L6,0)= 6 —0y; 

(ii) LO, 0) = (6 — 0/01 — 6); 
(iii) L(6, 0) = |6 — 9]. 
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8.1.10 Consider a decision problem in which @ can assume the values in the set 
{O0, 01}. If i.i.d. random variables X;,..., X, are observed, their joint p.d_f. 
is f(X,;0) where @ is either 6 or 6,. The prior probability of {9 = Oo} is 7. 
A statistician takes actions do or a;. The loss function associated with this 
decision problem is given by 


0, if6=6, 
EGO) le ifo = 8; 
1, if6 = 6, 
ee ib if @ = 6), 


(i) What is the prior risk function, if the statistician takes action ag with 
probability €? 
(ii) What is the posterior risk function? 
(iii) What is the optimal action with no observations? 
(iv) What is the optimal action after n observations? 


8.1.11 The time till failure of an electronic equipment, 7, has an exponential 
1 

distribution, i.e., T ~ G (+. 1); 0 < t < ow. The mean-time till failure, 
T 


t, has an inverted gamma prior distribution, 1/t ~ G(A, v). Given n 
observations on i.i.d. failure times 7),..., 7, the action is an estimator 
Tt, =t(T),..., T,). The loss function is L(t, t) = |t, — t|. Find the poste- 
rior risk of t,,. Which estimator will minimize the posterior risk? 


Section 8.2 


8.2.1 Let X be a random variable having a Poisson distribution, P(A). Consider 
the problem of testing the two simple hypotheses Hp : A = Ao against Hy : 
A=1130 <0 < AL <~. 

(i) What is the form of the Bayes test ¢,(X), for a prior probability a of 
Hp and costs c; and c> for errors of types I and II. 

(ii) Show that Ro(z) = c,U. — P([E(1)];A0)), where P(j;A) is the c.d.-f. 
of P(A), Ri) = co P(LE()]; A1), 


1s cl dy 
&() = | log + log — + Ay — Ao } /log ( — }, 
l-az C2 ho 


where [x] is the largest integer not exceeding x. 
(iii) Compute (Ro(), Ri ()) for the case of cy = 1, cz = 3; Ay/Ao = 2, 
A, — Ao = 2, and graph the lower boundary of the risk set R. 


8.2.2 Let X),..., X, bei.i.d. random variables having an exponential distribution 
G(A, 1), 0 < A < ow. Consider the two composite hypotheses Hp : A < Ao 
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8.2.4 
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1 
against H,; :A > Ag. The prior distribution of 4 is G (+. »), The loss 
t 


functions associated with accepting H; (i = 0, 1) are 


0, A < ho, 
Pol fee 


and 


0<B<m. 
(i) Determine the form of the Bayes test of Hp against H). 
(ii) What is the Bayes risk? 


Let X,,..., X, be iid. random variables having a binomial distribution 
BC, @). Consider the two composite hypotheses Ho : 6 < 1/2 against Hy : 
6 > 1/2. The prior distribution of 6 is B(p, g). Compute the Bayes Factor 
in favor of H, for the cases of n = 20, T = 15 and 

(i) p = g = 1/2 (Jeffreys prior); 

di) p=1,q =3. 


Let X,,..., X, be iid. random variables having a N(, 07) distribu- 
tion. Let Yj,..., Yj, be ii.d. random variables having a N(n, po’) dis- 
tribution, —co < “, n < 00, 0O< o’, p <c, The X-sample is indepen- 
dent of the Y-sample. Consider the problem of testing the hypothesis 
Ay: ep <1, (hu, n, 07) arbitrary against H,:p > 1, (u, 1, 07) arbitrary. 
Determine the form of the Bayes test function for the formal prior p.d.f. 


1 1 
h(t, n, 0, 0) « a and a loss function with c; = c. = 1. 
o~ 


Let X be a k-dimensional random vector having a multinomial distribution 
1 1 

M(n; 6). We consider a Bayes test of Hy : 0 = i against H,:0 4 rag Let 

@ have a prior symmetric Dirichlet distribution (8.2.27) with v = 1 or 2 with 

equal hyper-prior probabilities. 

(i) Compute the Bayes Factor in favor of H; when k = 5,n = 50, X; = 7, 
X2 = 12, X3 = 9, X4 = 15, and X5 = 7. [Hint: Approximate the values 
of [(vk +n) by the Stirling approximation: n! ~ e"n"/20n, for 
large n.] 


(ii) Would you reject Ho if cy = cy = 1? 
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8.2.6 


8.2.7 


Let X;, X2,... be a sequence of i.i.d. normal random variables, N(0, a”). 
Consider the problem of testing Hy : 0” = 1 against H, : 0? = 2 sequen- 
tially. Suppose that cy = 1 and cy =5, and the cost of observation is 
c=0.01. 
(i) Determine the functions p(s) forn = 1, 2, 3. 
(ii) What would be the Chernoff approximation to the SPRT boundaries 
(A, B)? 


Let X;, X2,... be a sequence of i.i.d binomial random variables, B(1, 6), 
0 <6 < 1. According to H; : 6 = 0.3. According to H; : 6 = 0.7. Let z, 
0 < a < 1, be the prior probability of Hy. Suppose that the cost for erroneous 
decision is b = 10[$] (either type of error) and the cost of observation 
is c = 0.1[$]. Derive the Bayes risk functions p(r), i = 1,2, 3 and the 
associated decision and stopping rules of the Bayes sequential procedure. 


Section 8.3 


8.3.1 


8.3.2 


8.3.3 


8.3.4 


8.3.5 


Let X ~ B(n, @) be a binomial random variable. Determine a (1 — @)-level 
credibility interval for 6 with respect to the Jeffreys prior h(@) « 671/71 — 
6)—'/2| for the case of n = 20, X = 17, anda = 0.05. 


Consider the normal regression model (Problem 3, Section 2.9). Assume 
that a? is known and that (a, 8) has a prior bivariate normal distribution 
N((p), ¥). 

(i) Derive the (1 — €) joint credibility region for (a, 8). 

(ii) Derive a (1 — €) credibility interval for a + B&, when &p is specified. 


(iii) What is the (1 — €) simultaneous credibility interval for a + Bé, for all 
&? 


Consider Problem 4 of Section 8.2. Determine the (1 — @) upper credibility 
limit for the variance ratio p. 


Let X,,..., X, be i.i.d. random variables having a N(w, o) distribution 
and let Y;,..., Y, be i.i.d. random variables having a N(n, o”) distribution. 
The Xs and the Ys are independent. Assume the formal prior for jw, 7, and 
o0,1.€., 


h(u, n, 0) «0, —OO <M, Nn<&, 0<0? <0. 


(i) Determine a (1 — a) HPD-interval for 6 = px — 7. 


(ii) Determine a (1 — w) HPD-interval for o. 


Let X,,..., X, be ii.d. random variables having a G(A, 1) distribution and 
let Yj,..., Yi» be i.i.d. random variables having a G(n, 1) distribution. The 
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Xs and Ys are independent. Assume that 4 and 7 are priorly independent 


1 1 

having prior distributions G (—. u) and G (=. vs) , respectively. Deter- 
Ty] T 

mine a (1 — aw) HPD-interval for w = A/n. 


Section 8.4 


8.4.1 


8.4.2 


8.4.3 


8.4.4 


8.4.5 


8.4.6 


Let X ~ B(n, 6), 0 < @ < 1. Suppose that the prior distribution of 6 is 
B(p, q),0 < p,q < . 
(i) Derive the Bayes estimator of 6 for the squared-error loss. 

(ii) What is the posterior risk and the prior risk of the Bayes estimator of 
(i)? 

(iii) Derive the Bayes estimator of 6 for the quadratic loss L(@,6)=(0— 

1 

6) /6(1 — 6), and the Jeffreys prior (> moet | 

(iv) What is the prior and the posterior risk of the estimator of (111)? 


Let X ~ P(A), 0 <A < oo. Suppose that the prior distribution of 1 is 


(=) 

G(-,v}. 

T 

(i) Derive the Bayes estimator of 4 for the loss function LO, A) = a(n _ 
r)t + b(A — A)~, where (-)+ = max(-, ) and (-)~ = — min(., 0); 0 < 
a,b<o. 


(ii) Derive the Bayes estimator for the loss function LOG, A) = (A — Ay /X. 


1 
(iii) What is the limit of the Bayes estimators in (ii) when v > 5 and 
TO. 


Let X1,..., X,, Y be i.i.d. random variables having a normal distribution 
N(u, 07); —00 < fe < 00, 0 <a? < oo. Consider the Jeffreys prior with 
h(, o*)dpdo” x dudo*/o?. Derive the y-quantile of the predictive dis- 
tribution of Y given (X,,..., Xn). 


Let X ~ P(A),0 < 4 < ov. Derive the Bayesian estimator of A with respect 
to the loss function L(A, A) = (A — A)? /A, and a prior gamma distribution. 


Let X1,..., X, be ii.d. random variables having a B(1, @) distribution, 
0 <6 <1. Derive the Bayesian estimator of 0 with respect to the loss 
function L(6, 6) = (6 — 6)°/@(1 — 8), and a prior beta distribution. 


In continuation of Problem 5, show that the posterior risk of 9 = © X/n with 
respect to L(6, 0) = (6 — 6)?/@(1 — 6) is 1/n for all © X;. This implies that 
the best sequential sampling procedure for this Bayes procedure is a fixed 
sample procedure. If the cost of observation is c, determine the optimal 
sample size. 
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8.4.7. Consider the normal-gamma linear model NG (6 ne, ) where Y is 


four-dimensional and 


ie, rs | 
a es | 
Oo eal (ee | 
a a Ce | 


(i) What is the predictive distribution of Y? 
(ii) What is the Bayesian estimator bp? 


8.4.8 As an alternative to the hierarchical model of Gelman et al. (1995) described 
in Section 8.4.2, assume that 11, ...,, are large. Make the variance stabi- 
lizing transformation 


Y, =2sin7! lima at eee 
n; + 3/4 


We consider now the normal model 
Y|0~ N(y, D), 


where Y= (Y,..., Ye)’, n=(m,---, me) with n; = 2sin7!(/6), i = 
1,...,k. Moreover, D is a diagonal matrix, D = diag} —,i = 1,..., kf. 


Nj 
Assume a prior multinormal distribution for 7. 


(i) Develop a credibility region for 7, and by inverse transformations (1:1) 
obtain credibility region for 6 = (6), ..., O). 


8.4.9 Consider the normal random walk model, which is a special case of the 
dynamic linear model (8.4.6), given by 


Yn = On + €n, 
On =On-1 tn, n=1,2,..., 


where 6) ~ N(no, Co), {€,} are iid. N(O, 07) and {w,} are iid. N(O, t7). 
Show that lim c, — c*, where c, is the posterior variance of 0,, given 


n—>oo 
(Y,,..., Y,). Find the formula of c*. 


Section 8.5 
8.5.1 The integral 


1 
I = e 99*(1 — 0)" *do 
0 
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can be computed analytically or numerically. Analytically, it can be com- 
puted as 


(i) 
© -_4j pl 
=y = oxtiq — 0)" *de 
0 


—aeseu 
= ae a et D 


j= 


(ii) Make the transformation w = 6/(1 — 6), and write 


00 p-w/(1+o) 
— —— exp{—n(log(1 + w) — p, logw)}do, 
roe p{—n(log(1 + @) — p, log w)} 


where p, = X/n. Let k(w) = log(1+@) — p, log(w) and f(w) = 
exp {- = /A+ w)’. Find which maximizes —k(w). Use (8.5.3) 
o 


and (8.5.4) to approximate /. 
(iii) Approximate 7 numerically. Compare (i), (ii) and (iii) for n = 20, 50 
and X = 15, 37. How good is the saddle-point approximation (8.5.9)? 


8.5.2 Prove that if U;, U2 are two i.i.d. rectangular (0, 1) random variables then 
the Box Muller transformation (8.5.26) yields two i.i.d. N(O, 1) random 
variables. 


8.5.3 Consider the integral J of Problem [1]. How would you approximate J by 
simulation? How would you run the simulations so that, with probability 
> 0.95, the absolute error is not greater than 1% of J. 


Section 8.6 


8.6.1 Let (X), 01), ..., (Xn, 9n), ... be a sequence of independent random vectors 
of which only the Xs are observable. Assume that the conditional distribu- 
tions of X; given 6; are B(1,6;),i = 1,2,..., and that 6), 62,... are iid. 
having some prior distribution H(@) on (0, 1). 
(i) Construct an empirical-Bayes estimator of 6 for the squared-error loss. 
(ii) Construct an empirical-Bayes estimator of @ for the squared-error loss, 
if it is assumed that H(@) belongs to the family 7 = {B(p,q):0 < p, 
q < Oo}. 


8.6.2 Let (X1, W),.-., (Xn, Wy), -.. be a sequence of independent random vec- 
tors of which only the Xs are observable. It is assumed that the conditional 
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distribution of X; given w; is NB(W;,v), v known, i = 1, 2,.... More- 
over, it is assumed that 1, Wo, ... are i.i.d. having a prior distribution H(@) 
belonging to the family 7 of beta distributions. Construct a sequence of 
empirical-Bayes estimators for the squared-error loss, and show that their 
posterior risks converges a.s. to the posterior risk of the true B(p, q). 


8.6.3 Let (X),41),..., (Xn, An), ... be a sequence of independent random vec- 
tors, where X; | A; ~ G(A;, 1),i = 1,2,..., and Aj, Ao, ... arei.i.d. having 


; 1 dea day 
a prior G (=. v } distribution; t and v unknown. Construct a sequence of 
T 


empirical-Bayes estimators of 4;, for the squared-error loss. 
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be ie ee 2 _< a a re = 
8.1.6 (i) Since X ~ N(O, 0°), S= dx o“x“[n], or S~ 20 G(I, 5): 
Thus, the density of S, given o”, is 


1 a} x26? 
fel O= Taagaet Ss 0 <x < 00. 


1 
Let ¢ = — and let the prior distribution of ¢ be like that of 
202 


1 
G (=. ), Hence, the posterior distribution of ¢, given S, is like that of 
T 
lon 
G({S+-,=+0]. 
t 2 


1y5+ 
(ii) Rei je Oe 


v [o.e) 
git 2e-FS+Dgg 
2r(5 +) ; 


VASE tyr +v—-1) 
~ We +vy(s+ hire 
_ Sti 

 n+2v—2 


Similarly, we find that 


2(S+ +) 
(n+ 2v —2)°(n +2v — 4)’ 


V{o7 | S}= 
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8.1.9 
X~ Bn,O), O<6<1, 
0~ B 1 1 
oe 


1 1 
O1X~P(X+5.n— x45), 


2 
Accordingly, 
P P pXtek Cp DX 48 
(i) E{(6 —6) | X} = 6? — 26 zi 2) 2) 
n+1 (n+ 1)(m + 2) 
(ii) 
ls ixbe n(n — 1)6? 2nd 
a(1 —@) ~ (X-hm-xX-4 n-x-} 
(X + 3) 
n(n — X — 5) 
(iii) 
P P 1 1 
E(\6 — 6| | X} = 261) (X+5,n-X+5 
X+5 3 1\ n-X+} 
=) In (X xX 
n+l af ae +5) n+ 


8.2.1 (i) The prior risks are Ro = cya and R; = c2(1 — 7), where z is the prior 
probability of Hp. These two risk lines intersect at m* = c2/(c; + c2). 
We have Ro(z*) = Ri (2%). The posterior probability that Hp is true is 


—Ao yx 
me “1g 


X)= . 
TX) = Tog + mehaF 


The Bayes test function is 


_ fl, ifm) <2", 
Ox (XY i if w(x) > 2%. 
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o,(X) is the probability of rejecting Hp. Note that ¢,(X) = 1 if, and 
only if, X > &(), where 


At Short legs ar log 
log(#) 


E(r) = 


(ii) Row) = c, E{o_(X)} 
= c,P{X > E(x) | A = Ao}. 


Let P(j;A) denote the c.d.f. of Poisson distribution with mean 2. Then 


Ror) = e111 — PLE (at); Ao), 


and 
Ry (7) = co P(E (0); 1). 
oes 2+ilog = = log) 
(iii) E(1) = ; 
log(2) 
Then 
R 1— P {| 1.300427 108 a ied A=2 
w)=1- ; + sA= , 
o(7) 0.693147 
R 3P | | 1.300427 108 a ie A=4 
107) = * 9.693147 |? =") 
8.3.2 We have a simple linear regression model Y; = a + Bx; +€;,i=1,...,n; 
where ¢; are i.i.d. N(0, 07). Assume that o? is known. Let (X) = (1p, Xn); 
where 1,, is an n-dimensional vector of 1s, xi, = (x1, ..., X,). The model is 


¥=00(§) +e 
=00(%) +6 


(i) Let 6 = () 60(%). The prior distribution of 6 is N(@o, ¥). Note 
0 
that the covariance matrix of Y is V[Y] = 07] +(X )X(XY. Thus, the 


posterior distribution of 6 , given Y, is N(j(Y), D), where 


M(Y) = 00 + EX (07T + (X)EXY'(Y — (X60), 
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and 
D = Y= (XY (071 + (X)E(XY) (XE. 
Accordingly 
(6 — (¥))D"(6 — n(¥) ~ x7 [2]. 
Thus, the (1 — @) credibility region for 0 is 


{0 : (0 — n(¥)) D-1(0 — n(¥)) < x7, [2]}. 


1 1 
(ii) Let w = (:) and Wo = (. ) a + B& = 0’wo. The posterior dis- 
0 
tribution of 6’wo, given Y, is N(n(Y)'Wo, W)DWo). Hence, a (1 — 
a) credibility interval for 6’wo, given Y, has the limits 9(Y)'wo + 
Z1-2/2(WyDwo)!”. 
(iii) Simultaneous credibility interval for all € is by Sheffe’s S-intervals 


n(Y) w+ (2x7_,[21)'/(w Dw)!?. 


8.4.1 
X~B,0), 0<6 <1. 


The prior of @ is Beta(p, g),0 < p,q < @. 
(i) The posterior distribution of 6, given X, is Beta(jp + X,q+n-—X). 
Hence, the Bayes estimator of 6, for squared-error loss, is bp = 


x 

CAE 6 ee 

n+pt+q 
x a 

Gy Tieden Ve (je 


(n+ p+qPat+ptqtl) 
Pq 
(p+qp+q+Dn+pt+q) 


The 


prior risk is Ey {MSE(4g)} = 


(iii) The Bayes estimator is 


‘ x-1)t 
iy = min (Sa P" ) 
n—-1 
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1 
8.4.3 The Jeffreys prior is h(w,o7)dudo? = dude’. A version of the like- 
0 


lihood function, given the minimal sufficient satistic (X, Q), where QO = 


n 
Y\(X; — X), is 
i=1 


L(u, 0? | X, Q) = —~ exp " (x — py}. 
(o2)2 202 2a? 


Thus, the posterior density under Jeffrey’s prior is 


2)-Gtb as ay pint 
ACh, 0°) | x, Q) = @ y expt 0 at L) } 
-G+D QQ hn, 4 Fi 
me (et ) exp| Io2 552% LL) | audo 
Now, 
oo = 
Ze n 
Furthermore, 
wt n—-1 
[ ee a a = [etre $F dg = ea re, 
2)5+1 = ; 
0 (07)2 5 oi 
Accordingly, 
= o> 1 O n(X 7 ‘ay 
h(u,07 | X, O)= ; a . 
(u,0° | X, Q) Tey wm ok a x 


Thus, the predictive density of Y, given [X, Q], is 


fav 18,0) = 3 a ae ce 


23 Ja sen ‘ 


loa) X-— 2 = 2 
(fe n( oe) on | an) Wee 


or 


ino (1+ . ees 
Faly |X, = fo ne n+1 QO 
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Recall that the density of t[n — 1] is 


1 Pp \? 
tn—-l)= 1 + —— : 
io Jn—1 BS, + ( ——) 
Thus, the y-quantile of the predictive distribution fy(y | X,Q) is X + 
tyIn—1 1 
yin 1) ) (1 + ). 
n—-1 n 
1 
l= / e 99*(1— 0)" *do = B(X +1,n—-X+4+ 1)E{e%}, 


0 
where U ~ B(X + 1,n—X+1). By simulation, we generate M iid. 
values of Uj,...,Uy, and estimate I by Lu = B(X¥+1,n-—X+1)- 


Iu For large M, fy ~ AN {I =p where D = B?(X + 1,n — 
M = < ? sd M 2 ’ 
X+1)V{e-¥}. We have to find M large enough so that P{\fu -I\|< 
0.017} > 0.95. According to the asymptotic normal distribution, we deter- 
mine M&M so that 


VM 0.01 E{e~¥} 
2@ 1 > 0.95, 
( J V{eY} 7 


or 


Pe Xeett) ( Ele?) 1). 
= 0.0001 \ (E(e-2)2 


By the delta method, for large M, 


2X+1 


E{e 24} & e772 (1+2 


(X+ Din-X+ ) 
(n + 2)2(n + 3) 


Similarly, 


E{e-U} & et (1 a) 
2 (n+2)(n+3) 


Hence, M should be the smallest integer such that 


(X+D)(n—X+1) 
oS Xs[1] 1+2 (nt2)2n-+3) 


~ 0.0001 1 (X+41)(n—X+1) 2 
(1 +3 (n+2)?(n+3) ) 
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CHAPTER 9 


Advanced Topics in Estimation Theory 


PART I: THEORY 


In the previous chapters, we discussed various classes of estimators, which attain 
certain optimality criteria, like minimum variance unbiased estimators (MVUE), 
asymptotic optimality of maximum likelihood estimators (MLEs), minimum mean- 
squared-error (MSE) equivariant estimators, Bayesian estimators, etc. In this chapter, 
we present additional criteria of optimality derived from the general statistical deci- 
sion theory. We start with the game theoretic criterion of minimaxity and present some 
results on minimax estimators. We then proceed to discuss minimum risk equivariant 
and standard estimators. We discuss the notion of admissibility and present some 
results of Stein on the inadmissibility of some classical estimators. These examples 
lead to the so-called Stein-type and Shrinkage estimators. 


9.1. MINIMAX ESTIMATORS 


Given aclass D of estimators, the risk function associated with eachd € Dis R(d, 6), 


0 € ©. The maximal risk associated with d is R*(d) = sup R(d, 0). If in D there is 
dO 
an estimator d* that minimizes R*(d) then d* is called a minimax estimator. That is, 


sup R(a*, 6) = inf sup R(d, 9). 
660 deD 96 


A minimax estimator may not exist in D. We start with some simple results. 


Lemma 9.1.1. Let F = {F(x;0), 0 € ©} be a family of distribution functions and 
D aclass of estimators of 0. Suppose that d* € D and d* is a Bayes estimator relative 
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to some prior distribution H*(@) and that the risk function R(d*, 0) does not depend 
on 0. Then d* is a minimax estimator. 


Proof. Since R(d*, 0) = o* for all 9 in ©, and d* is Bayes against H*(@) we have 


p* = [ Rea" on ovae = int [ Rea. oyn@vao 


(9.1.1) 
< sup inf R(d, 0) < ane sup R(d, 9). 
gc@deD 
On the other hand, since p* = R(d*, @) for all 0 
p* = sup R(d*, 0) > gus sup R(d, 0). (9.1.2) 
660 P 960 
From (9.1.1) and (9.1.2), we obtain that 
sup R(d*, 0) = ant sup R(d, 0). (9.1.3) 
660 D 90 
This means that d* is minimax. QED 


Lemma 9.1.1 can be generalized by proving that if there exists a sequence of 
Bayes estimators with prior risks converging to p*, where p* is a constant risk of d*, 
then d* is minimax. We obtain this result as a corollary of the following lemma. 


Lemma 9.1.2. Let {Hy;k > 1} be a sequence of prior distributions on © and 
let {Ox;k > 1} be the corresponding sequence of Bayes estimators with prior risks 
p(O, Hy). If there exists an estimator d* for which 


sup R(@*, 0) < hen sup pbx, A), (9.1.4) 
660 


then d* is minimax. 


Proof. If d* is not a minimax estimator, there exists an estimator 6 such that 


sup R(6, 0) < sup R(d*, 6). (9.1.5) 


660 660 
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Moreover, for each k > | since & is Bayes, 


Ont) = / (Gg, Oyhx(8)d8 
(9.1.6) 
< [ R@.ern@a0 < sup RG, 6). 


660 


But (9.1.5) in conjunction with (9.1.6) contradict (9.1.4). Hence, d* is 
minimax. QED 


9.2 MINIMUM RISK EQUIVARIANT, BAYES EQUIVARIANT, AND 
STRUCTURAL ESTIMATORS 


In Section 5.7.1, we discussed the structure of models that admit equivariant esti- 
mators with respect to certain groups of transformations. In this section, we return 
to this subject and investigate minimum risk, Bayes and minimax equivariant esti- 
mators. The statistical model under consideration is specified by a sample space V 
and a family of distribution functions F = {F(x;6);6 € ©}. Let G be a group of 
transformations that preserves the structure of the model, i.e., g¥ = X forall g € G, 
and the induced group G of transformations on @ has the property that © = © for 
all  € G. An equivariant estimator 6(X) of 6 was defined as one which satisfies the 
structural property that 6(gX) = 36(X) for all g € G. 

In cases of various orbits of G in ©, we may index the orbits by a parameter, 
say w(@). The risk function of an equivariant estimator §(X) is then RO, @(0)). 
Bayes equivariant estimators can be considered. These are equivariant estimators 
that minimize the prior risk associated with @, relative to a prior distribution H(@). 
We assume that w(@) is a function of 9 for which the following prior risk exists, 
namely, 


p(6, H) = / R(6, w(6))dH(0) 
a (9.2.1) 
= i, R(6, w)dK(o), 
Q 


where K (q) is the prior distribution of w(@), induced by H(0). Let U(X) be a maximal 
invariant statistic with respect to G. Its distribution depends on @ only through w(@). 
Suppose that g(u; @) is the probability density function (p.d.f.) of UCX) under w. Let 
k(@ | U) be the posterior p.d.f. of @ given U(X). The prior risk of 6 can be written 
then as 


06, H) = Eujx {Ew {RO, o)}}. (9.2.2) 


where E,,\y{ R(6, w)} is the posterior risk of 0, given U (x ). An equivariant estima- 
tor Ox is Bayes against K (w) if it minimizes Ewu{RO, w)}. 
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As discussed earlier, the Bayes equivariant estimators are relevant only if there are 
different orbits of G in ©. Another approach to the estimation problem, if there are 
no minimum risk equivariant estimators, is to derive formally the Bayes estimators 
with respect to invariant prior measures (like the Jeffreys improper priors). Such an 
approach to the above problem of estimating variance components was employed by 
Tiao and Tan (1965) and by Portnoy (1971). We discuss now formal Bayes estimators 
more carefully. 


9.2.1 Formal Bayes Estimators for Invariant Priors 


Formal Bayes estimators with respect to invariant priors are estimators that minimize 
the expected risk, when the prior distribution used is improper. In this section, we 
are concerned with invariant prior measures, such as the Jeffreys noninformative 
prior h(0)d@ « |1(6)|!/2d6. With such improper priors, the minimum risk estimators 
can often be formally derived as in Section 5.7. The resulting estimators are called 
formal Bayes estimators. For example, if F = {F(x;0); co < 6 < oo} isa family 
of location parameters distributions (of the translation type), i.e., the p.df.s are 
f(x; 0) = @(@ — 6) then, for the group G of real translations, the Jeffreys invariant 
prior is h(0)dé « dé. If the loss function is LO, 6) = (6 — 6), the formal Bayes 
estimator is 


/ “oT Jo — 0)d0 


78 j=l 


ie [ [ow — 0)d0 
TO j=] 


E{o|X}= 


(9.2.3) 


Making the transformation Y = X() — 6, where X(1) < --- < X(q), we obtain 


yo] [o(Xw — Xa) + y)dy 
ae i=2 


E{@ | X} = Xap 


: (9.2.4) 
oy] [oX%w — Xa) t+ y)dy 


=e. i=2 


This is the Pitman estimator (5.7.10). 
When F is a family of location and scale parameters, with p.d.f.s 


1 x—p 
fQ5H, 0) = ~0( i (9.2.5) 


oO 
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we consider the group G of real affine transformations G = {[a, 8]; —co < a < 00, 
0 < B < o}. The fisher information matrix of (1, 0) for F is, if exists, 


Ilia te 
I(u,o)=—s ; 9.2.6 
(u,0) =| ae (9.2.6) 
where 
| 
I=V ; (9.2.7) 
| ou) 
/ / 
ip hye s09 (s wy? ~) (9.2.8) 
pu) plu) 
and 
nea 
In =V iw? : (9.2.9) 
| $(w) 
: 1 it fap hac 
Accordingly, |/(, «)| o —; and the Jeffreys invariant prior is 
o 
dud 
h(w, o)dpdo x EO, (9.2.10) 
or 
ue essed pete ees 
If the invariant loss function for estimating w is L(A, 4,0) = ——,— then the 
or 


formal Bayes estimator of ju is 


bh 
EL SIX 
A oO 
nanny (i Game | 
els! 
BE RR op a ee, (9.2.11) 
— d 
[ef alle (AS*) aoe 
ae oe oe ie Xj — 
— dod 
‘el zal lo( 7 ) aoa 
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Let yj < --- < y, represent realization of the order statistics X(j) < +--+ < X(n). Con- 
sider the change of variables 


yi-u 
u= ' 
oO 
pee (9.2.12) 
oO 
got, £25, 
Vo V1 


then the formal Bayes estimator of ju is 


i woe f v' dtu + vy] [ou + vz)dvdu 
—oo 0 a 
A=yn—-O2-y)-— = = - (9.2.13) 
[ow f vou +n] Joes vendvdu 
oe o i=3 


For estimating o, we consider the invariant loss function L(6,0) = (6 — a) do. 
The formal Bayes estimator is then 


i oan | veut v)] [owt vedvdu — 0.2.14) 
Fig: i=3 


= Sie a = ‘ 
/ ow f vou + vy] [eu +vz,;)dvdu 
ee o i=3 


Note that the formal Bayes estimator (9.2.13) is equivalent to the Pitman estimator, 
for a location parameter family (with known scale parameter). The estimator (9.2.14) 
is the Pitman estimator for a scale parameter family. 

Formal Bayes estimation can be used also when the model has parameters that 
are invariant with respect to the group of transformations G. In the variance com- 
ponents model discussed in Example 9.3, the variance ratio p = t/a? is such an 
invariant parameter. These parameters are called also nuisance parameters for the 
transformation model. 


9.2.2 Equivariant Estimators Based on Structural Distributions 


Fraser (1968) introduced structural distributions of parameters in cases of invari- 
ance structures, when all the parameters of the model can be transformed by the 
transformations in G. Fraser’s approach does not require the assignment of a prior 
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distribution to the unknown parameters. This approach is based on changing the vari- 
ables of integration from those representing the observable random variables to those 
representing the parameters. We start the explanation by considering real parameter 
families. More specifically, let F = {F(x;0);@ € ©} be a family of distributions, 
where © is an interval on the real line. Let G be a group of one-to-one transforma- 
tions, preserving the structure of the model. For the simplicity of the presentation, 
we assume that the distribution functions of F are absolutely continuous and the 
transformation in G can be represented as functions over VY x ©. Choose in © a 
standard or reference point e and let U be a random variable, having the distribution 
F(u;e), which is the standard distribution. Let ¢(u) be the p.d-f. of the standard dis- 
tribution. The structural model assumes that if a random variable X has a distribution 
function F(x; 6); when @ = ge, g € G, then X = gU. Thus, the structural model can 
be expressed in the formula 


X=GlU,0), @0€90. (9.2.15) 
Assume that G(u, @) is differentiable with respect to u and 6. Furthermore, let 
u=G'(x;0); xeX, O€0. (9.2.16) 
The function G(u, @) satisfies the equivariance condition that 
gx = Gu, g0), allgeG; 
with an invariant inverse; 1.e., 
u=G'(x,0)=G '(gx, 0), allgeG. 


We consider now the variation of u as a function of @ for a fixed value of x. Writing 
the probability element of U at u in the form 


p(ujdu = ¢(G"'(x, @)) | ONG, 0) | do, (9.2.17) 

we obtain for every fixed x a distribution function for 6, over ©, with p.d.f. 
k(0;x) = o(G|(x, 6))m(@, x), (9.2.18) 
where m(0, x) = | a G~'(x, @)|. The distribution function corresponding to k(6, x) is 


called the structural distribution of 6 given X = x. Let L(Q(x), @) be an invariant 
loss function. The structural risk of 8(x) is the expectation 


R(O(x)) = / L(6(x), 0)K(0; x)d0. (9.2.19) 
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An estimator 6(x) is called minimum risk structural estimator if it minimizes 
R(A(x)). The p.d.f. (9.2.18) corresponds to one observation on X. Suppose that a 
sample of independent identically distributed (i.1.d.) random variables X1,..., Xn 
is represented by the point x = (x1,...,X,). As before, 6 is a real parameter. Let 
V(X) be a maximal invariant statistic with respect to G. The distribution of V(X) is 
independent of 6. (We assume that © has one orbit of G.) Let k(v) be the joint p.d.f. of 
the maximal invariant V(X). Let uw) = G~!(x;, 0) and let @(u | x) be the conditional 
p.d.f. of the standard variable U = [6]~'X, given V = v. This conditional p.d.f. of 
0, for a given x is then, like in (9.2.18), 


k(0;x) = &(G'(x1, 8) | v(x))m(@,x)), 9 €O. (9.2.20) 


If the model depends on a vector 0 of parameters we make the appropriate general- 
izations as will be illustrated in Example 9.4. 

We conclude the present section with some comment concerning minimum prop- 
erties of formal Bayes and structural estimators. Girshick and Savage (1951) proved 
that if all equivariant estimators in the location parameter model have finite risk, then 
the Pitman estimator (9.2.24) is minimax. Generally, if a formal Bayes estimator with 
respect to an invariant prior measure (as the Jeffreys priors) and invariant loss func- 
tion is an equivariant estimator, and if the parameter space © has only one orbit of G, 
then the risk function of the formal Bayes estimator is constant over ©. Moreover, if 
this formal Bayes estimator can be obtained as a limit of a sequence of proper Bayes 
estimators, or if there exists a sequence of proper Bayes estimators and the lower 
limit of their prior risks is not smaller than the risk of the formal Bayes estimator, then 
the formal Bayes is a minimax estimator. Several theorems are available concerning 
the minimax nature of the minimum risk equivariant estimators. The most famous 
is the Hunt-Stein Theorem (Zacks, 1971; p. 346). 


9.3 THE ADMISSIBILITY OF ESTIMATORS 


9.3.1 Some Basic Results 


The class of all estimators can be classified according to the given risk function into 
two subclasses: admissible and inadmissible ones. 


Definition. An estimator 6,(x) is called inadmissible with respect to a risk function 
R(O, 0) if there exists another estimator 62(x) for which 


(i) R(6>,0)< RG,6), forall, 
x : (9.3.1) 
(ii) R(@>2, 6’) < R(, 6’), for some 6’. 


From the decision theoretic point of view inadmissible estimators are inferior. It 
is often not an easy matter to prove that a certain estimator is admissible. On the 
other hand, several examples exist of the inadmissibility of some commonly used 
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estimators. A few examples will be provided later in this section. We start, however, 
with a simple and important lemma. 


Lemma 9.3.1 (Blyth, 1951). If the risk function R(@, 9) is continuous in @ for each 
6, and if the prior distribution H(@) has a positive p.d.f. at all @ then the Bayes 
estimator by (x) is admissible. 


Proof. By negation, if §(x) is inadmissible then there exists another estimator 6* (x) 
for which (9.3.1) holds. Let 0* be a point at which the strong inequality (ii) of (9.3.1) 
holds. Since RO, @) is continuous in @ for each @, there exists a neighborhood N(6*) 
around 6* over which the inequality (ii) holds for all 6 € N(0*). Since h(@) > 0 for 
all 6, Py{N(6*)} > 0. Finally from inequality (i) we obtain that 


[ x0" onoa0 = Ro° encode + f R(60*, 0)h(0)de 
) N(6*) N(*) 


(9.3.2) 
< / R(x, 0)h(O)d0 + 1 R64, O)h(O)dO. 
N(O*) *) 


N(O* 


The left-hand side of (9.3.2) is the prior risk of 6* and the right-hand side is the prior 
risk of 67. But this result contradicts the assumption that 0, is Bayes with respect 
to H(@). QED 


All the examples, given in Chapter 8, of proper Bayes estimators illustrate admissi- 
ble estimators. Improper Bayes estimators are not necessarily admissible. For exam- 
ple, in the N(u, a”) case, when both parameters are unknown, the formal Bayes 
estimator of o7 with respect to the Jeffreys improper prior h(o*)da? « do?/o? is 
Q/(n — 3), where Q = D(X; — X )?. This estimator is, however, inadmissible, since 
Q/(n + 1) has a smaller MSE for all o”. There are also admissible estimators that 
are not Bayes. For example, the sample mean X from a normal distribution N(6, 1) 
is an admissible estimator with respect to a squared-error loss. However, X is not 
a proper Bayes estimator. It is a limit (as k — oo) of the Bayes estimators derived 


in Section 8.4, 6; = X (1 + 7 . X is also an improper Bayes estimator with 
n 


respect to the Jeffreys improper prior h(@)d@ « dé. Indeed, for such an improper 
prior 


[sex [-3(% = 6)°| ae 
Seems es 2 
pisces. aX (9.3.3) 


/ exp [5 = ey} do 


The previous lemma cannot establish the admissibility of the sample mean X. We 
provide here several lemmas that can be used. 
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Lemma 9.3.2. Assume that the MSE of anestimator 6, attains the Cramér-Rao lower 
bound (under the proper regularity conditions) for all 80, —oo < 6 < &, which is 


(1+ Bi) 


C\(0) = BO) + Cae 


(9.3.4) 


where B,(@) is the bias of 6,. Moreover, if for any estimator 6) having a Cramér—-Rao 
lower bound C2(0), the inequality C2(0) < C,(0) for all 6 implies that B2(@) = B,(@) 
for all 0, then 0, is admissible. 


Proof. If 6, is inadmissible, there exists an estimator 6) such that 
R(6>,0) < R@,0), forall 6, 
with a strict inequality at some 0’. Since R(6,, 8) = C,(@) for all 6, we have 
C20) = R(2, 8) < RO, 0) = Ci(6) (9.3.5) 


for all @. But, according to the hypothesis, (9.3.5) implies that B; (0) = B2(@) for all 6. 
Hence, C;(9) = C2(9) for all 6. But this contradicts the assumption that R(@2, 0’) < 
R(6,, 6’). Hence, 6, is admissible. QED 


Lemma 9.3.2 can be applied to prove that, in the case of a sample from N(0, o), 

S= ! yo is an admissible estimator of o*. (The MVUE and the MLE 
n+2 4 

are inadmissible!) In such an application, we have to show that the hypotheses 

of Lemma 9.3.2 are satisfied. In the N(0, 07) case, it requires lengthy and tedious 

computations (Zacks, 1971, p. 373). Lemma 9.3.2 is also useful to prove the following 


lemma (Girshick and Savage, 1951). 


Lemma 9.3.3. Let X be a one-parameter exponential type random variable, 
with p.d.f. 


f(x, W) = hx) exp{yx — K(W)}, 


—o <w < oo. Then fi = X is an admissible estimator of its expectation u(w) = 
+K'(w), for the quadratic loss function (fi — 2)? /o7(); where o?(v) = +K"(w) 
is the variance of X. 
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Proof. The proof of the present lemma is based on the following points. First X is 
an unbiased estimator of j1(w). Since the distribution of X is of the exponential type, 
its variance o7(w) is equal to the Cramér—Rao lower bound, i.e., 


ow) = (WW /TAb) 
(9.3.6) 
= (07 (Wy /T). 


This implies that 7(w) = o7 (1), which can be also derived directly. If {(X) is any 
other estimator of ju(y) satisfying the Cramér—Rao regularity condition with variance 
D?*(w), such that 


D*(W) <0), all —co < W <0, (9.3.7) 


then from the Cramér—Rao inequality 


BW) + vw wy 


FH <D(y)<o(y), ally, (9.3.8) 


Bw) + 
where B(y) is the bias function of /7(X). Thus, we arrived at the inequality 


Byer) + (BW) + 0°7WP < ot), (9.3.9) 


all —oo < W < oo. This implies that 


/ 2 
Bw) <0 and B*(W)+ 2B) < etal <0, (9.3.10) 
o7*(W) 
for all —co < y < oo. From (9.3.10), we obtain that either B(y) = 0 for all w or 
a : | = = (9.3.11) 
dy |B) 2 


for all & such that B(y) ¥ 0. Since B(w) is a decreasing function, either B(w) = 0 
for all Y¥ > Wo or BC) £ 0 for all &W > Yo. Let G(w) be a function defined so that 
G(Wo) = 1/B(Wo) and G'(yr) = 1/2 for ally > Wo; ie., 


1 1 
GW) = Bay + ZY — Ho), ally = Yo. (9.3.12) 


d 
Since 1/B(y) is an increasing function and Po ae > 1/2, it is always above 
G(w) on w > Wo. It follows that 


lim (1/B(W)) = 00, or Jim B(w) = 0. (9.3.13) 
yoo phe 
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In a similar manner, we can show that fue (1/B(w)) = ooor lim B(w) = 0. This 
micas >-00 


implies that B(w) = 0 for all yw. Finally, since the bias function of ((X) = X is also 
identically zero we obtain from the previous lemma that fi(X) is admissible. QED 


Karlin (1958) established sufficient condition for the admissibility of a linear 
estimator 


1 
ney er ee aie (9.3.14) 


of the expected value of T in the one parameter exponential family, with p.d_f. 
f (x30) = B@)e™, O<0<8. 


Theorem 9.3.1 (Karlin). Let X have a one-parameter exponential distribution, 
with @ < @ < 0. Sufficient conditions for the admissibility of (9.3.14) as estimator of 
Eo{T(X)}, under squared-error loss, is 


6% a-yO 
lim / —— dd =a 
a6 Jo, [B(O))* 


or 


6 p-vrO 
tim, [ sa dd =H, 
o> Jy (BO) 


where 0 < 0% <6. 


For a proof of this theorem, see Lehmann and Casella (1998, p. 331). 

Considerable amount of research was conducted on the question of the admissi- 
bility of formal or generalized Bayes estimators. Some of the important results will 
be discussed later. We address ourselves here to the question of the admissibility 
of equivariant estimators of the location parameter in the one-dimensional case. We 
have seen that the minimum risk equivariant estimator of a location parameter 0, 
when finite risk equivariant estimators exist, is the Pitman estimator 


6(X) = Xa — E{Xqy | X@ — Xw,---, X@ — Xw}- 


The question is whether this estimator is admissible. Let Y = (XQ) — X(1), .-» X(n) — 
X 1) denote the maximal invariant statistic and let f(x | y) the conditional distribution 
of X(1), when 6 = 0, given Y = y. Stein (1959) proved the following. 
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Theorem 9.3.2. If §(X) is the Pitman estimator and 
E(LE(IXqy — E{Xay | YP | Y}°7} < 00, (9.3.15) 
then 0(X) is an admissible estimator of 0 with respect to the squared-error loss. 


We omit the proof of this theorem, which can be found in Stein’s paper (1959) 
or in Zacks (1971, pp. 388-393). The admissibility of the Pitman estimator of a 
two-dimensional location parameter was proven later by James and Stein (1960). 
The Pitman estimator is not admissible, however, if the location parameter is a vector 
of order p > 3. This result, first established by Stein (1956) and by James and Stein 
(1960), will be discussed in the next section. 

The Pitman estimator is a formal Bayes estimator. It is admissible in the real 
parameter case. The question is under what conditions formal Bayes estimators 
in general are admissible. Zidek (1970) established sufficient conditions for the 
admissibility of formal Bayes estimators having a bounded risk. 


9.3.2 The Inadmissibility of Some Commonly Used Estimators 


In this section, we discuss a few well-known examples of some MLE or best equiv- 
ariant estimators that are inadmissible. The first example was developed by Stein 
(1956) and James and Stein (1960) established the inadmissibility of the MLE of the 
normal mean vector 0, in the N(0, 1) model, when the dimension of 0 is p > 3. The 
loss function considered is the squared-error loss, L(6, 0) = |6 — 0|?. This example 
opened a whole area of research and led to the development of a new type of esti- 
mator of a location vector, called the Stein estimators. Another example that will 
be presented establishes the inadmissibility of the best equivariant estimator of the 
variance of a normal distribution when the mean is unknown. This result is also due 
to Stein (1964). Other related results will be mentioned too. 


I. The Inadmissibility of the MLE in the N(@, 1) Case, With p = 3 

Let X be a random vector of p components, with p > 3. Furthermore assume that 
X ~ N(6, J). The assumption that the covariance matrix of X is J, is not a restrictive 
one, since if X ~ N(@, V), with a known V, we can consider the case of Y = 
C~!X, where V = CC’. Obviously, Y ~ N(y, I) where y = C~'@. Without loss of 
generality, we also assume that the sample size is n = 1. The MLE of @ is X itself. 
Consider the squared-error loss function L(6, 0) = \|6 — 6||*. Since X is unbiased, 
the risk of the MLE is R* = p for all 6. We show now an estimator that has a risk 
function smaller than p for all 6, and when @ is close to zero its risk is close to 2. The 
estimator suggested by Stein is 


6= (1 ee a =) x. (9.3.16) 
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This estimator is called the James—Stein estimator. The risk function of (9.3.16) is 

| 
x’(X — 0) 

= Eo{\|X — 0||"} — 2(p — 2)Ep ( = ) 0.3.17) 


p-2 
x’'xX 


rd.) = ba | |x 0 x 


9 1 


The first term on the RHS of (9.3.17) is p. We notice that X’X ~ y7[p; 500]. 
Accordingly, 


1 
Eg {xx} = Eg E((x*[p + 2J]) '}} 


X’'X 
1 
= E¢ SS LA ’ 
p-24+2J 


where J ~ P(%). We turn now to the second term on the RHS of (9.3.17). Let 
2 


. Note that U ~ N(||6||, 1) is independent of 


(9.3.18) 


U 
X——_06 
(141 


V andV ~ x7[p — 1]. Indeed, we can write 


60" 
V=x'(1- GG )x. (9.3.19) 


where A = (J — 06’/||6||?) is an idempotent matrix of rank p— 1. Hence, V ~ 
x?[p — 1]. Moreover, A@/||@|| = 0. Hence, U and V are independent. Furthermore, 


U = X’6/||0|| and V = 


0 6 |; 


|X| |? = Ix U +U 
||| || 


= U7 + V 4+ 2X'A6/||0|| = U2 + V. 


(9.3.20) 


We let W = ||X||*, and derive the p.d.f. of U/W. This is needed, since the second 
term on the RHS of (9.3.17) is —2(p — 2)[1 — ||@||Eo{U/ W}]. Since U and V are 
independent, their joint p.d.f. is 


Fy |8)= exp | stu eur? 
(9.3.21) 


1 - 
a aR RATT WG * exp | 5}. 
2° — 1/2r(254) 2 
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Thus, the joint p.d.f. of U and W is 


p—3 
(w —u?) 


J2n 25 T(254) 


glu, w;30) = 


1 1 

exp}—~||||> —=w + ||O|lut, (9.3.22) 
2 2 

0 <u? <w < oo. Thep.df. of R = U/W is then 


hr |0)= ik we(rw,w;0)dw. (9.3.23) 
0 


Accordingly, 


U eo 7llall? oo 
E@ | = aa ; / dw 
W V2n 27 T(2-) Jo 


vw u 2 p-3 1 
: / —(w—u 1 exp {ati — Sf a. 
_Jw W 2 


(9.3.24) 


By making the change of variables to t = u/,/w and expanding exp{||6||t,/w} we 
obtain, after some manipulations, 


(X= 6YX) _ U 
Ea ( XX j= Nez, {| 


exp{—5I01/7} = P0177! 


=1-—|(6]| 


Ja ee EG 
oye re (9.3.25) 
1 5 ||O||-)/ —2 
= exp{ —Lyi2} 5 GU pad 
2 = jy! p-2+2j 
—2 
-a(-22n) 
—242J 
where J ~ P(50'0). From (9.3.17), (9.3.18) and (9.3.25), we obtain 
; (p — 2) 
RO,0)= E , alld. 9.3.26 
(0,0) = p (2, <p, a ( ) 


Note that when 6=0, Po[J =O] =1 and R(0,0)=2. On the other hand, 


A 


lim RO, 0) = p. The estimator @ given by (9.3.16) has smaller risk than the 


||8|| 00 

MLE for all # values. In the above development, there is nothing to tell us whether 
(9.3.16) is itself admissible. Note that (9.3.16) is not an equivariant estimator with 
respect to the group of real affine transformations, but it is equivariant with respect 
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to the group of orthogonal transformations (rotations). If the vector X has a known 
covariance matrix V, the estimator (9.3.16) should be modified to 


6(X) = (1 = ing) x. (9.3.27) 


This estimator is equivariant with respect to the group G of nonsingular transforma- 
tions X — AX. Indeed, the covariance matrix of Y= AX is Y= AVA’. Therefore, 
y'y-'y = xX’'V~'X forevery A € G. 

Baranchick (1973) showed, in a manner similar to the above, that in the usual 
multiple regression model with normal distributions the commonly used MLEs of 
the regression coefficients are inadmissible. More specifically, let X1,..., Xn be 
a sample of n iid. (p + 1) dimensional random vectors, having a multinormal 
distribution N(@, X). Consider the regression of Y = X; on Z = (X2,..., Xp41)’. If 
we consider the partition 6’ = (n, ¢’) and 


" t7|C’ 
t= (Sty): 


then the regression of Y on Z is given by 


E{Y |Z}=a+p'Z, 


where a =n—f’¢ and B=V~'C. The problem is to estimate the vec- 
tor of regression coefficients B. The least-squares estimators (LSE) is B = 


so (Sov = nva) ,where S = )°Z;Zi —nZZ’.¥,,..., ¥, and Z),...,Z, are 
i=l i=l 
the sample statistics corresponding to Xj,..., Xp. 

Consider the loss function 


L(@, B;a, B, Y)=((a — w) + (B — BYEP + (B — BYV(B—- B+ (7? —CV'O. 
(9.3.28) 


With respect to this loss function Baranchick proved that the estimators 


A 1—R?\ . 
“TR (9.3.29) 


al 


Y—B.Z, 


Oe 
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have risk functions smaller than that of the LSEs (MLEs) B and @ = Y — BZ , at 


2(p — 2 

eae %) and p = 3.0 > p+ 2. Ris 
n—-p+2 

the squared-multiple correlation coefficient given by 


all the parameter values, provided c € (0 


R= (x Y;Z; — v2) a (x YZ; — v2) / (x Ye -n?) 
i=l i=l i=l 


The proof is very technical and is omitted. The above results of Stein and Baranchick 
on the inadmissibility of the MLEs can be obtained from the following theorem 
of Cohen (1966) that characterizes all the admissible linear estimate of the mean 
vector of multinormal distributions. The theorem provides only the conditions for the 
admissibility of the estimators, and contrary to the results of Stein and Baranchick, it 
does not construct alternative estimators. 


Theorem 9.3.3 (Cohen, 1966). Let X ~ N(@, I) where the dimension of X is p. 
Let 6 = AX be an estimator of 0, where A is a p x p matrix of known coefficients. 
Then 6 is admissible with respect to the squared-error loss \|6 — 0||* if and only if A 
is symmetric and its eigenvalues a;(i = 1,..., p) satisfy the inequality. 


O<a; <1, forall i=1,...,p, (9.3.30) 
with equality to I for at most two of the eigenvalues. 


For a proof of the theorem, see Cohen (1966) or Zacks (1971, pp. 406-408). Note 
that for the MLE of @ the matrix A is /, and all the eigenvalues are equal to 1. Thus, 
if p > 3, X is an inadmissible estimator. If we shrink the MLE towards the origin 
and consider the estimator 8, = AX with 0 < A < 1 then the resulting estimator is 
admissible for any dimension p. Indeed, 6, is actually the Bayes estimator (8.4.31) 


with A; = V =1, Y =rt7/ and A; = 0. In this case, the Bayes estimator is B, — 
2 


ak where 0 < t < oo. We set A = t7/(1 + 17). According to Lemma 9.3.1, 
T 
this proper Bayes estimator is admissible. In Section 9.3.3, we will discuss more 


meaningful adjustment of the MLE to obtain admissible estimators of 0. 


I. The Inadmissibility of the Best Equivariant Estimators of the Scale Parameter 
When the Location Parameter is Unknown 
Consider first the problem of estimating the variance of a normal distribution N (2, 07) 


when the mean yp is unknown. Let X1,..., X, be i.i.d. random variables having 
n 


x 2 1 
such distribution. Let (X¥, Q) be the minimal sufficient statistic, X = -S°Xx ; and 
n 


i=1 


n 
Q= be: fm. ae We have seen that the minimum risk equivariant estimator, with 


i=1 
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1 
respect to the quadratic loss L(6?, 07) = (6° —o’) is t= = Q/(n+ 1). Stein 


(1964) showed that this estimator is, however, inadmissible! The estimator 


a2 : O+nx? OQ 
= ss 9.3.31 
on min ( AED: Pel ( ) 


has uniformly smaller risk function. We present here Stein’s proof of this inadmissi- 
bility. 

Let S = Q +nX?. Obviously, S ~ x?[n;np?/207] ~ x*[n + 2J] where J ~ 
P(np? / 207). Consider the scale equivariant estimators that are functions of (Q, S). 
Their structure is f(Q, S) = SH(2). Moreover, the conditional distribution of O/S 


1 1 
given J is the beta distribution 6 = 5 + J ) Furthermore given J, Q/S and 


S are conditionally independent. Note that for on we use the function ¢o (2) = 


Q 


——.. Consider the estimator 
Sin + 1) 


n+1>n+2 


. 1 Q 1 
= Smin } ——.-—, : 
n+1 S$ n+2 


7 ee Q S 
6; = min 
(9.3.32) 


1 1 
Here, ¢ 2 = min g — }. The risk function, for the quadratic loss 
S n+1S n+2 


L(6, 07) = (6? — 0”)? /o* is, for any function (2). 


7 He 1] ° 


where Q ~ 0? x?[n — 1] and S ~ o7x3[n + 2J]. Let W = Q/S. Then, 


R(o) = E{E{x3In + 2J]6(W) — 17 | J, W} 
= E{¢’(W)(n + 2J)(n + 2J + 2) — 26(W)(n + 2J) + 1} 


1 > 2 


(9.3.34) 
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We can also write, 


2 


1 2 
(ecw) = a) = (ecw) = 6.) = 


2D; 
a) n+2J+2 


(9.3.35) 


1 
(a - + 2(go(W) — o1(W)) (ow = ) . 


Ww 1 
We notice that 6;(W) < ¢o(W). Moreover, if al < ee: then ¢,(W) = ¢o(W), 
n n 
and the first and third terms on the RHS of (9.3.35) are zero. Otherwise, 


1 1 
BAW) = Or = = FSO Mees 


= ’ 
+27 n+2+2J 


Hence, 


1 2 1 2 


for all J and W, with strict inequality on a (J, W) set having positive probability. 
From (9.3.34) and (9.3.36) we obtain that R(¢1) < R(¢o). This proves that an is 
inadmissible. 

The above method of Stein can be used to prove the inadmissibility of equivariant 
estimators of the variance parameters also in other normal models. See for example, 
Klotz, Milton and Zacks (1969) for a proof of the inadmissibility of equivariant 
estimators of the variance components in Model II of ANOVA. 

Brown (1968) studied the question of the admissibility of the minimum risk (best) 
equivariant estimators of the scale parameter o in the general location and scale 
xp 

o 
oo. The loss functions considered are invariant bowl-shaped functions, L(6). These 
are functions that are nonincreasing for 6 < 59 and nondecreasing for 5 > 59 for some 


1 
parameter model, with p.d.f.s f(x; 4,0) = —@ ;-OO<p<0,0<0 < 
o 


s ole 
do. Given the order statistic X(1) < --- < Xn) of the sample, let X = ~)°Xq S= 
n 


i=1 


1/2 
i . 2 
{20% - ws | and Z; = (X@ — X)/S, i =3,...,0. Z=(Zs,..., Zn) is 
n 
i=l 

a maximal invariant with respect to the group G of real affine transformations. The 
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best equivariant estimator of w = o* is of the form &p = $0(Z)S*, where $o(Z) is an 
optimally chosen function. Brown proved that the estimator 


b\(Z)S*, if |— 
o(S, Z) = 


do(Z)S*, if 


< K(Z), 


(9.3.37) 


2s > K(Z) 
S —_ > 


where K(Z) is appropriately chosen functions, and ¢)(Z) < ¢$9(Z) has uniformly 
smaller risk than @o. This established the inadmissibility of the best equivariant 
estimator, when the location parameter is unknown, for general families of dis- 
tributions and loss functions. Arnold (1970) provided a similar result in the spe- 
cial case of the family of shifted exponential distributions, ie., f(x; u,0) = I{x = 


1 x—p 
LL} — expf{ }, -Co< Uw<w,0<0 < mM. 

oj 

Brewster and Zidek (1974) showed that in certain cases one can refine Brown’s 
approach by constructing a sequence of improving estimators converging to a gen- 
eralized Bayes estimator. The risk function of this estimator does not exceed that of 


the best equivariant estimators. In the normal case N(j1, o”), this estimator is of the 
form $*(Z)Q, where 


6*(z) = E{Q | Z < z}/E{Q? | Z < z}, (9.3.38) 


with QO = Dai X)’, Z = J/n|X|//Q. The conditional expectations in (9.3.38) 


are soiapited with « = 0 and o = 1. Brewster and Zidek (1974) provided a general 
group theoretic framework for deriving such estimators in the general case. 


9.3.3. Minimax and Admissible Estimators of the Location Parameter 


In Section 9.3.1, we presented the James—Stein proof that the MLE of the location 
parameter vector in the N(6, 7) case with dimension p > 3 is inadmissible. It was 
shown that the estimator (9.3.16) is uniformly better than the MLE. The estimator 
(9.3.16) is, however, also inadmissible. Several studies have been published on the 
question of adjusting estimator (9.3.16) to obtain minimax estimators. In particular, 
see Berger and Bock (1976). Baranchick (1970) showed that a family of minimax 
estimators of 6 is given by 


64 = (: = eae) X, (9.3.39) 
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where S = X’X and ¢(S) is a function satisfying the conditions: 


(i) O0< ¢(S) <2; 
(9.3.40) 
(ii) (S$) is nondecreasing in S. 


If the model is N(@,07/) with known o? then the above result holds with S = 
X’'X/o?. If o? is unknown and 6? is an estimator of o” having a distribution like 
o- x7 P1/(v + 2) then we substitute in (9.3.39) S = X'X/6?. The minimaxity of 
(9.3.39) is established by proving that its risk function, for the squared-error loss, 
does not exceed the constant risk, R* = p, of the MLE X. Note that the MLE, X, is 
also minimax. In addition, (9.3.39) can be improved by 


a \ + 
— 2)o($ 
es =(1- (P= 2005) \ x (9.3.41) 
S 
where at = max(a, 0). These estimators are not necessarily admissible. Admissible 


and minimax estimators of @ similar to (9.3.39) were derived by Strawderman (1972) 
for cases of known o7 and p > 5. These estimators are 


p-—2a+2 


0,(%) = (1 5 


. #5) X, (9.3.42) 


where 5 <a <1 for p=5and0 <a < 1 for p > 6, also 


= G(3;$ —a+t2) 
STS Ga;$-at+l1)’ 


in which G(x; v) = P{G(1, v) < x}. In Example 9.9, we show that 6,(X) are gener- 
alized Bayes estimators for the squared-error loss and the hyper-prior model 


conn 
(i) g1n~N (0,41), 0<A <1; 
r (9.3.43) 


(ii) A has a generalized prior measure G(A) so that 
4 Pp 
dG(A)xrA“drh, -w<a< rae 
Note that if a < 1 then G(A) is a proper prior and 6(X) isa proper Bayes estima- 


tor. For a proof of the admissibility for the more general case of a < 7 + 1, see 
Lin (1974). 
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9.3.4 The Relationship of Empirical Bayes and Stein-Type Estimators of the 
Location Parameter in the Normal Case 


Efron and Morris (1972a, 1972b, 1973) show the connection between the estimator 
(9.3.16) and empirical Bayes estimation of the mean vector of a multinormal distribu- 
tion. Ghosh (1992) present a comprehensive comparison of Empirical Bayes and the 
Stein-type estimators, in the case where X ~ N(O, 1). See also the studies of Lindley 
and Smith (1972), Casella (1985), and Morris (1983). 

Recall that in parametric empirical Bayes procedures, one estimates the unknown 
prior parameters, from a model of the predictive distribution of X, and substitutes 
the estimators in the formulae of the Bayesian estimators. On the other hand, in 
hierarchical Bayesian procedures one assigns specific hyper prior distributions for the 
unknown parameters of the prior distributions. The two approaches may sometimes 
result with similar estimators. 

Senane with the simple model of p-variate normal X | @ ~ N(@,/), and 0 ~ 
N(O, tI), the Bayes estimator of 0, for the squared-error loss L(6, 6) = ||6 — 4] |?, 
is 6, = = (1 — B)X, where B=1/(1+ t’). The predictive distribution of X is 
N(0, B~'J). Thus, the predictive distribution of X’X is like that of Bo!y?[p]. Thus, 
for p > 3, (p — 2)/X'X is predictive-unbiased estimator of B. Substituting this esti- 
mator for B in 6g yields the parametric empirical Bayes estimator 


: 2 
69 = (1-2 xX. 9.3.44 
re ( - (9.3.44) 


Og derived here is identical with the James—Stein estimator (9.3.16). If we change 
the Bayesian model so that 9? ~ N(jx1, 7), with known, then the Bayesian estimator 
is 0g = (1 — B)X+ By, and the corresponding empirical Bayes estimator is 


A(2) p= 2 (p — 2) 
Gp = (1 (xX — wl)(X— a) =u (X — v1) (X - mr Dae 


(9.3.45) 
ae ake, aT 
[[X — pl ||? 
If both jz and t are unknown, the resulting empirical Bayes estimator is 
S s p- 2 
62) =X (X — XD), (9.3.46) 


(x — XY 
i=1 
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Ghosh (1992) showed that the predictive risk function of 6°), namely, the trace 
of the MSE matrix E((%), — 6); — 6Y'} is 


: 23 
Ce (9.3.47) 


Thus, 6. has smaller predictive risk than the MLE 6 y, = X if p > 4. 
The Stein-type estimator of the parametric vector B in the linear model X ~ 
AB + € is 


Be =(1- BB, (9.3.48) 
where B is the LSE and 
B = min(1, (p — 2)o7/B’ A’ AB). (9.3.49) 


It is interesting to compare this estimator of 6 with the ridge regression estimator 
(5.4.3), in which we substitute for the optimal k value the estimator po?/B’B. The 
ridge regression estimators obtains the form 


2 
p= (1 ees zta'ay) B. (9.3.50) 


There is some analogy but the estimators are obviously different. A comprehensive 
study of the property of the Stein-type estimators for various linear models is presented 
in the book of Judge and Bock (1978). 
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Example 9.1. Let X be a binomial B(n, 6) random variable. n is known, 0 < 6 < 
1. If we let 6 have a prior beta distribution, i.e., 6 ~ 6(v, v2) then the posterior 
distribution of 6 given X is the beta distribution B(v; + X, v2 +n — X). Consider 


the linear estimator Oy, B= ax + B. The MSE of bu, p is 
n 


R(6u,8,9) = B? + “11 —2(1—a@)+(1 — a) — 2nB(1 — a)] 


62 
Sel a) 40 ay (1 —n)). 
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We can choose @° and f° so that R60, 50, 0) = (B°)’. For this purpose, we set the 
equations 


1—2(1-—a@)+(1—a@)* — 2nB(1 — a) = 0, 
1-—2(1-a)+(1-—ay(1—7n) =0. 
The two roots are 
a = J/n/(1t+ Jn), 
p= si + /n). 
With these constants, we obtain the estimator 


1 1 
6* = 
Jaa+ Jay * 20+ Jay’ 


with constant risk 


1 


BOO Gainey 


for all 0. 


We show now that 6* is a minimax estimator of 6 for a squared-error loss by specifying 
a prior beta distribution for which 6* is Bayes. 
The Bayes estimator for the prior B(11, v2) is 


yt Xx 1 Vj 
= X+ - 
vy +v+n vj +v2+n vyytv+n 


a 
Py..v. = 


In particular, if vy) = v2 = a2 then GD sioks = 6*. This proves that 6* is minimax. 


Finally, we compare the MSE of this minimax estimator with the variance of the 
MVUE, X/n, which is also an MLE. The variance of 6 = X/nis O11 —9@)/n. V{6} 
at 6 = 1/2 assumes its maximal value of 1/4n. This value is larger than R(6*, @). 
Thus, we know that around 6 = 1/2 the minimax estimator has a smaller MSE than 
the MVUE. Actually, by solving the quadratic equation 


67 —64n/41+J/ny =0, 


we obtain the two limits of the interval around 6 = 1/2 over which the minimax 
estimator is better. These limits are given by 


1 pe vi t2vn 
2\ - 14+ fn ]- 


A2= 
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Example 9.2. 


A. Let X1,..., X, be i.i.d. random variables, distributed like N(6@, 1), —oo < 
06< oH. The MVUE (or MLE), X, has a constant variance n~!. Thus, if 
our loss function is the squared-error, X is minimax. Indeed, the estimators 


1 
= xX (: + =) ,k = 1,2,..., are Bayes with respect to the prior N (0, k) 


distributions. The risks of these Bayesian estimators are 


(6 ad pee ie : k=1,2 
aici eae nk Cane ey 


But p(O, k) > n7! as k — oo. This proves that X is minimax. 

B. Consider the problem of estimating the common mean, jz of two normal distri- 
butions, which were discussed in Example 5.24. We can show aha (X + Y)/2 
is aminimax estimator for the ne loss function (fi — j4)?/o* max(1, p). 
If the loss function is (i — j1)*/o? then the minimax estimator is X, regardless 
of Y. This is due to the large risk when p — oo (see details in Zacks (1971, 
p. 291)). a 


Example 9.3. Consider the problem of estimating the variance components in the 
Model II of analysis of variance. We have k blocks of n observations on the random 
variables, which are represented by the linear model 


Yjj = Uta t+ ei, i=1,...,k, PSA tess 


where e;; are iid. N(O, a”); dy, ..., Qg are i.i.d. random variables distributed like 
N(0, t7), independently of {e; ey In eee 3.3, we have established that a minimal 


sufficient statistic is T = yyy? ip S77, Y |. This minimal sufficient statistic 
i=1 j=1 i=1 
can be represented by T* = (Q., Qa, Y), where 


k n 
Soli — Yi)? ~ 0? x? k(n = VI, 


i=l j=l 


Q- 


k 
Oa =n) (¥;—¥) ~ 071 +np)xz Ik — 1, 


i=1 


and 


=2 nk? 
knY ~o7(1 a 
n on ( | ited 
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p = t7/o? is the variance ratio, x70] i = 1, 2,3 are three independent chi-squared 
random variables. Consider the group G of real affine transformations, G = {[a, 6]; 
—00 <a < 0,0 < 6 < oo}, and the quadratic loss function LO, 6) = (6 - 6)°/0°. 
We notice that all the parameter points (1, 07, t~) such that t*/o* = p belong to 
the same orbit. The values of p, 0 < p < 00, index the various possible orbits in the 
parameter space. The maximal invariant reduction of T* is 


7, OW, O., On) = (0. L &:) . 


Thus, every equivariant estimator of o7 is the form 


wy) Ou Q. 
=Q,. —)]= 1+ Uf(U)), 
6, =O “(&) eT + Uf(U)) 
where U = Q,/Q-, &(U) and f(U) are chosen functions. Note that the distribution 
of U depends only on p. Indeed, U ~ (1 + np)x31k — 1]/x7[k( — 1]. The risk 
function of an equivariant estimator oy is (Zacks, 1970) 


“322 nk —1 > U+npy ‘ 
rom Par Bate (or ee [rw-L]). 


If K(¢) is any prior distribution of the variance ratio p, the prior risk Ex{R(f, e)} is 
minimized by choosing f(U) to minimize the posterior expectation given U, ie., 


2 2; 
Eno |e eon) fw J}. 


(1+U +npy l+n 
The function fx (U) that minimizes this posterior expectation is 


E,{ +np) + U +np)-*} 


La E,u{A + np). + U +np)-7} 


The Bayes equivariant estimator of o7 is obtained by substituting fx (w) in 6}. For 
more specific results, see Zacks (1970b). | 


Example 9.4. Let X,,..., X,, bei.id. random variables having a location and scale 
parameter exponential distribution, i.e., 


X;~uw+oG;,1), i=1,...,n 


-0O <U<Ww,0<0 <M. 
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A minimal sufficient statistic is (X(1), S,), where Xq) <--- < Xi) and S, = 
1 


n—1 


n 
Six (i) — Xq1)). We can derive the structural distribution on the basis of the 
i=2 
minimal sufficient statistic. Recall that X(1) and S,, are independent and 


Xa) ~ w+oGa ~ w+oG(n, 1), 


oO n 
Sag GO EO OGG hme d) 


where Gq) < --- < Gy) is an order statistic from a standard exponential distribution 
GC, 1), corresponding to u = 0,0 = 1. 

The group of transformation under consideration is G = {[a,b]; —oo <a, 
—0o <a < ©, 0 <b < ox}. The standard point is the vector (G1), Sc), where 
Gay = (Xa) — w)/o and Sg =S,/o. The Jacobian of this transformation is 


Sn : 
J(Xay, Sn, bh, ) = a Moreover, the p.d.f. of (G1), Sg) is 
oO 


n(n — Dee n—-2 
Cl eer PE, (a exp{—nu —(n—1)s};0<u<w, 0<s< om. 
n— 2)! 


Hence, the structural distribution of (w, 0) given (X(1), S,) has the p.d.f. 


nin =1jy") gn Ree Ce 
K(1, 03 Xe, Sn) = Mae S Xa) * Sha a ea 


for —oo < uw < X(1),0<0 < oO. 

The minimum risk structural estimators in the present example are obtained in the 
following manner. Let L(fi, 4, 0) = (ft — 1)? /o” be the loss function for estimating 
jl. Then the minimum risk estimator is the j-expectation. This is given by 


n(n — 1 a n—-1 

E{u | Xa, Sn} = aoe ; 
x Sn Xa Xa — uw 
/ a exp| (n — 1) jae f pexp {-n | an 
n ae oO 
Sn 
= Ay 
eR aya x yao toe 
ny @) ()) — 7 ne 


i=2 
It is interesting to notice that while the MLE of yw is X(1), the minimum risk 
structural estimator might be considerably smaller, but close to the Pitman estimator. 
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The minimum risk structural estimator of o, for the loss function L(G,0) = 
(6 —o)*/o?, is given by 


= E : |X 
G2 
_ V/Sn Sn =D _ on 
~ oN 1 7 n celts 
n—1 S? 
One can show that 6 is also the minimum risk equivariant estimator of o. a 


Example 9.5. A minimax estimator might be inadmissible. We show such a case 
in the present example. Let X,,..., X, be ii.d. random variables having a normal 
distribution, like N(w, o”), where both ju and o” are unknown. The objective is to 
estimate o7 with the quadratic loss L(6?, 07) = (67 — 0’ / o*. The best equivariant 
estimator with respect to the group G of real affine transformation is ¢? = Q/(n + 1), 


= 7 \2 : : : ae) 2 
where Q = xe — X,)°. This estimator has a constant risk R(G°, 0°) = : 
= n+l 
Thus, 6 is minimax. However, 67 is dominated uniformly by the estimator (9.3.31) 
and is thus inadmissible. a 


Example 9.6. In continuation of Example 9.5, given the minimal sufficient statistics 
(X,,, Q), the Bayes estimator of o”, with respect to the squared error loss, and 
the prior assumption that and o? are independent, ~ ~ N(0, t7) and 1/207 ~ 
G(A, v), is 


[oma + 20nt?)~'/? expf{ sale 0(O+A)} 
T exp; — ——————— — 
i ae ey 


62, (X, = = 


0 
oe 3 Onx : 
6"-3 4] 4. 26nt?) 1/7 ———__ — 9(6 +. A)}d6 
i} 21 + 20nt?)"? exp(-—— — 000 + 9)) 


This estimator is admissible since h(j1, 07) > 0 for all (ww, 07) and the risk function 
is continuous in (2, 07). | 


Example 9.7. Let X,,..., X,, bei.i.d. random variables, having the location parame- 

ter exponential density f(x; 2) = e~°- I{x > jw}. The minimal sufficient statistic is 

Xa = min {X;}. Moreover, X(1) ~ 4 + G(n, 1). Thus, the UMVU estimator of ju is 
<i<n 


1 
fi = Xa) — —. According to (9.3.15), this estimator is admissible. Indeed, by Basu’s 
n 
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Theorem, the invariant statistic Y= (Xi) — X(1), Xa) — Xa), ---, X(n) — X(n-1)) 8 
independent of X(1). Thus 


1\? 
(Xa) — E{Xcy | Y= (xo = *) : 
and 


3 
E((E{(Xqy — E{Xqy | Y}" | YP} = E {(x0 sas *) | 


6 3 


| 
Example 9.8. Let Y ~ N(6,/), where 6 = HB, 
1 -1 -l 1 
1 1 -1 -l 
BN bs a” a 
1 1 1 1 


1 
and f’ = (Bi, Bo, 63, Ba). Note that ae is an orthogonal matrix. The LSE of 


i 1 se 
B is B =(H'H)'H'Y= go and the LSE of @ is 6 = H(A’H)'H’ =Y. 


The eigenvalues of H(H’H)~'H' are a; = 1, fori = 1,..., 4. Thus, according to 
Theorem 9.3.3, 6 is admissible. | 


Example 9.9. Let X ~ N(@,/). X is a p-dimensional vector. We derive the gen- 


eralized Bayes estimators of @, for squared-error loss, and the Bayesian hyper-prior 
(9.3.43). This hyper-prior is 


Ree 
ain~n (0,1), C2 


and 


dG(A) xA7~4di, —co<a< = + 1. 


The joint distribution of (X’, 0’), given A, is 


x 
ia~nto,| . * . 
6 a oe 
———_ ] —__] 
x Xr 


592 


ADVANCED TOPICS IN ESTIMATION THEORY 


Hence, the conditional distribution of 6, given (X, A) is 


0|X,A~ NCI —A)X, I —A)S). 


1 
The marginal distribution of X, given A, is N (0 ae ) Thus, the density of the 


posterior distribution of A, given X, is 


p Xr 
h(a | x) ME exp | Fs] 0<A<\1. 


where S = X’X. It follows that the generalized Bayes estimator of 0, given X is 


where 


6.(X) = (1 — E{a | XX, 


1 
[ sbortetsan 


E{aA|X}= 
Marte aha), 
0 
_ p-2at2 PIGI,$—a+2)< 3} 
SS P{Gd, 8 —at+1)< §} 


PART II: PROBLEMS 


Section 9.1 


9.1.1 


Consider a finite population of N units. M units have the value x = 1 and 
the rest have the value x = 0. A random sample of size n is drawn without 
replacement. Let X be the number of sample units having the value x = 1. 
The conditional distribution of X, given M,n is H(N, M,n). Consider the 
problem of estimating the parameter P = M/N, with a squared-error loss. 


Show that the linear estimator Py, p=a—+B, witha = 
n 
1+ 


N-n 
n(N — 1) 


1 
and 6B = 5 — a) has a constant risk. 


Let 1 be a family of prior distributions on ©. The Bayes risk of H € 1 is 
P(A) = p(n, A), where On is a Bayesian estimator of 6, with respect to 
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H. H* is called least-favorable in 1 if o(H*) = sup p(#). Prove that if H 
HEH 


is a prior distribution in 1 such that o(H) = sup R(O, 6,,) then 
Tete) 


(i) Oy is minimax for 1; 
(ii) H is least favorable in H. 


9.1.3 Let X, and X, be independent random variables X, | 6; ~ B(n, 0;) and 
X>7 | 0. ~ B(n, 62). We wish to estimate 5 = 02 — 0. 
(i) Show that for the squared error loss, the risk function Rb , 91, 02) of 


J2n s 
n/n ly 


attains its supremum for all points (6;, 62) such that 6; + 62 = 1. 


§(X1, Xo) = 


X)) 


(ii) Apply the result of Problem 2 to show that §(X,, X>) is minimax. 


9.1.4 Prove that if 6(X) is a minimax estimator over ©,, where ©; C © and 


sup RO, 0) = sup RO, 0), then 6 is minimax over ©. 
d€0, 660 


9.1.5 Let X,,..., X, be a random sample (i.i.d.) from a distribution with mean 
6 and variance o7 = 1. It is known that |@| < M, 0 < M < oo. Consider 
the linear estimator 6a.b = aX, +b, where X,, is the sample mean, with 
O<a<l. 

(i) Derive the risk function of 6, ». 
(ii) Show that 


sup R(6..p5 0) = max{R(q.p, —M), Rap, M)}. 
6:\0|<M 


x : 1 
(iii) Show that 6,» = a*X,,, with a* = M?/ G + -) is minimax. 
n 


Section 9.2 
9.2.1 Consider Problem 4, Section 8.3. Determine the Bayes estimator for ju, 
6 = «4 — 7 and o° with respect to the improper prior h(w, 1, 07) specified 
there and the invariant loss functions 
L(A, w= (A, wy /o7, L(b,8)=(6 — 8°)/o* 
and L(67, 07) =(67 —07)"/o*, 


respectively, and show that the Bayes estimators are equivariant with respect 
G = {[a, B];-00 < a < 00,0 < B < ov}. 
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9.2.2 Consider Problem 4, Section 8.2. Determine the Bayes equivariant estimator 
of the variance ratio p with respect to the improper prior distribution specified 
in the problem, the group G = {[a, B]; -oo < a < ~w,0 < B < oo} and the 
squared-error loss function for p. 


9.2.3 Let X,,..., X, bei.i.d. random variables having a common rectangular dis- 
tribution R(@, 6 + 1), —oo < @ < oo. Determine the minimum MSE equiv- 
ariant estimator of 6 with a squared-error loss L@ y= (6 — 6) and the 
group G of translations. 


9.2.4 Let X,,...,X, be iid. random variables having a scale parameter distri- 
bution, 1.e., 


F=|26(2).0<0 <oof. 


(i) Show that the Pitman estimator of o is the same as the Formal Bayes 
estimator (9.2.14). 
(ii) Derive the Pitman estimator of G when ¢(x) = e-*I{x > O}. 


Section 9.3 


9.3.1 Minimax estimators are not always admissible. However, prove that the 
minimax estimator of @ in the B(n, @) case, with squared-error loss function, 
is admissible. 


9.3.2 Let X ~ B(n, @). Show that 9 = X/n is an admissible estimator of 0 
(i) for the squared-error loss; 
(ii) for the quadratic loss (6 - 6)?/6(1 — 8). 


9.3.3 Let X be a discrete random variable having a uniform distribution on 
{0, 1,..., 0}, where the parameter space is © = {0, 1, 2, ...}. For estimating 

6, consider the loss function LO, 6) = AG) — 6). 
(i) Derive the Bayesian estimator 67, where H is a discrete prior distribu- 


tion on ©. 
6 


(ii) What is the Bayesian estimator of 8, when h(@) = e~* - 6=0,1,..., 
0<t<@? 

(iii) Show that @ = X is a Bayesian estimator only for the prior distribution 
concentrated on 9 = 0, 1.e., Py{9 = O} = 1. 

(iv) Compare the risk function of 6 with the risk function of 6, = max(1, X). 

(v) Show that an estimator 6,, = m is Bayesian against the prior H,,, Ss.t. 
Py {9 =m} =1,m=0,2,.... 
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9.3.4 


9.3.5 


9.3.6 


9.3.7 


9.3.8 


9.3.9 


9.3.10 


9.3.11 


Let X bea random variable (r.v.) with mean @ and variance 07,0 < a” < 00. 
Show that 6, , = aX + bis an inadmissible estimator of 0, for the squared- 
error loss function, whenever 

(i) a > 1, or 

(ii) a < 0, or 
dii) a= landb 40. 


Let X,,..., X, bei.i.d. random variables having a normal distribution with 
mean zero and variance o? = 1/¢. 


(i) Derive the Bayes estimator of o”, for the squared-error loss, and gamma 
i Sod yw no 
,ie.,¢6~Gi—, — }. 
prior, 1.e., b ( aa) 
(ii) Show that the risk of the Bayes estimator is finite ifm + no > 4. 
(iii) Use Karlin’s Theorem (Theorem 9.3.1) to establish the admissibility of 


this Bayes estimator, when n + ng > 4. 


Let X ~ B(n, @),0 < 6 < 1. For which values of A and y the linear estima- 
tor 


Pes xX i yr 
ame (ee Oe 


is admissible? 


Prove that if an estimator has a constant risk, and is admissible then it is 
minimax. 


Prove that if an estimator is unique minimax then it is admissible. 

Suppose that F = {F(x;0), @ € ©} is invariant under a group of transfor- 
mations G. If © has only one orbit with respect to G (transitive) then the 
minimum risk equivariant estimator is minimax and admissible. 


Show that any unique Bayesian estimator is admissible. 


Let X ~ N(6, 17) where the dimension of X is p > 2. Show that the risk 


‘ —2 
function of 0. = { 1 — fo) X, for the squared-error loss, is 
a p-2 c(2—c) 
RO6,.,0)=1 E 
“re : | XP 


For which values of c does 6. dominate 6 = X? (i.e., the risk of 6. is 
uniformly smaller than that of X). 
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9.3.12 Show that the James—Stein estimator (9.3.16) is dominated by 
z —2\t 
p= (1 = r=) X, 
|X]| 
where at = max(a, 0). 


PART IV: SOLUTIONS OF SELECTED PROBLEMS 


9.1.1 


xX. i -—1 2 
(i) The variance of — is —P(1 — P) (1 y -): Thus, the MSE of Py, g 
non - 


is 


x a n—1l 2 
MSE{ Pog} = ; PU —P) (1 )+o (1—a)P) 


1 
Nowe hie fie Yee t 
ow, Tor — _ = an ee = we ge 
veo nt N-1 ee 


2; 
ae eee, cer ee ee 
MELE G ate ~ An ¢ 1) (14 J/( i=1)3) 


This is a constant risk (independent of P). 


(i) 
x 2(,(1 — 01) + O2(1 — O)) + (0 — 62)? 
R(5, 6, 05) = 
(6, 01, 02) + Jiny? 
es (0; + 62)(2 — (6; + @2)) 
(1+ V2ny2 , 


Let w = 6, + 6, g(w) = 2w — w? attains its maximum at w = 1. Thus, 
R(6, 0, 92) attains its supremum on {(0, 02) : 0; + 02 = 1}, which is 


at 
oe (1 + /2n)2 
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(ii) Let 6,, 6 be priorly independent, having the same prior distribution 
beta(a, a). Then, the Bayes estimator for this prior and squared-error loss 
is 


53(X1, X2) = E{O, — 6 | X1, Xo} = El, | Xi} E{O2 | Xo} 
MFG Oe4+6@) . Xi % 


n+ 2a nL + 24)" 


n 1 
Let a = — { 1 + —— } then the Bayes estimator 5,3(X 1, X2) is equal to 
er ( —) y p(X, X2) is eq 


i(X1, X>). Hence, 


x 1 
*(Sp)= su R(6, 61, 62) = p* (5) = ——-——-. 
p (dz) ee (6, 01, 2) = p’ (6) (+ Jane 


Thus, i(X1, X>) is minimax. 


9.1.5 
(i) 6,5 = aX, +b. Hence, 


R (62,5, 0) = MSE{6a,p} 


2 b 2 @ 
=(l-—a) (0 +—., 
l-a n 


(ii) Since |0| < M <«, 


sup R(6z,5,9) = max{R(6,,5, M), R@a,5, —M)} 


6:\0|<M 
2 b \2 
+d aia (M ) ; 

n l-a 


(iii) For b = 0, 


is a a) 
sup R(q.0,9) = — + (1 —a)M’. 
|0|<M Ht 
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The value of a that minimizes this supremum is a* = M*/(M? + 1/n). 
Thus, if a = a* andb = 0 


sup ROqx. 0,O)= He sup ROa.5, 9). 
\0|<M 5) \6|<M 


Thus, 6* = a*X is minimax. 
9.3.3 
© = {0,1,2,...}, L@, 6) = 46 — 9). 
1 
(Gi) f(x;6) = —l]{x € (0, 1,...,0)}. Let h(@) be a discrete p.d-f. on ©. 
The posterior p.d.f. of 6, given X = x, is 


iGO = x} 
AO | x) = “——__.. 


AG) 
i+ 
ja 
The posterior risk is 
Ex {0(6 — 6) | X} = E{O?(X)6 — 2676(X) +6? | X}. 
Thus, the Bayes estimator is the integer part of 


4 E{62 | x} 


~ E{O| x} 


se rare) 


cee, (5 


o | ; 
Veh) 
j=x 


J 
(ii) If A(j) = er j =0,1,..., the Bayesian estimator is the integer 


part of 


ht 
i) 

|2 
~. 


= 
+ 
_ 
+ 


By(x) = 


: 


rors TMs 


_ 
ll 
tad 
= 
+ 
a. 
_ 
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Let p(i;t) and P(i, t) denote, respectively, the p.d.f. and c.d.f. of the 
Poisson distribution with mean t. We have, 


[oe] < : 
ig 5 ee : 1 
ety aaa = 1— P(x ~ byt) — =(1~ PQ). 


j=x 
Similarly, 
CO *2 j 
= JT oT 
e* =t(1— P(x —2;t 
2 TE (anid a )) 
1 
—(1 — P(x — 1;1)) + —C — P(a;7T)). 
T 
Thus, 


t(1 — P(x — 2;1)) — (1 — P(@w = 131) + 41 — P(@; 7) 


Bux) = 1 — P(x —1;T) x6! P(x;T)) 


Note that By(x) > x for all x > 1. Indeed, 


a. ti aa ti 


js! 


for all x > 1. 
(iii) If P{O = 0} = 1 then obviously P{X = 0} = 1 and By(X) =0= X.On 
the other hand, suppose that for some j > 0, P{@ = j} > 0, then P{X = 
1 1 
j= - and, with probability mare By(X) > X. Thus, By(X) = 
J 
X = Oif, and only, P{@ = 0} = 1. 
(iv) Let 6 = X. The risk is then 


RO, 0) = Eg{(X —0Y} = le - jy 
_ 0(20 + 1) 
=. 


Note that RO, 0) = 0. Consider the estimator 6, = max(1, X). In this 
case, R(6, 0) = 1. We can show that RO, 0) < R(6, 9) for all 0 > 1. 
However, since R(8, 0) < R(6,, 0), 6, is not better than 0. 
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